Sales Forecasting for Walmart in Mexico¶


Author: Daniel Eduardo López

LinkedIn | GitHub

11 Sep 2024

sales-figures-1473495_450p.jpg

Image Credit: wagg66 from FreeImages.


Table of Contents¶


  1. Introduction
    1.1 Background
    1.2 General Objective
    1.3 Research Question
    1.4 Hypothesis
    1.5 Methodology
    1.6 Notebook's Goal
  2. Data Collection
  3. Data Exploration
    3.1 Data Description
    3.2 Data Quality
    3.3 Initial Data Analysis
  4. Data Preparation
    4.1 Data Selection
    4.2 Attributes Renaming
    4.3 Data Types Adjustment
    4.4 Missing Values Handling
    4.5 Derived Attributes Construction
    4.6 Data Aggregation
    4.7 Data Integration
    4.8 Predictors and Response Split
  5. Exploratory Data Analysis
    5.1 Statistical Measures
    5.2 Time Patterns
    5.3 Distributions and Relationships
    5.4 Correlations
    5.5 Differencing
    5.6 Stationarity
    5.7 Autocorrelation Function
    5.8 Decomposition
    5.9 Cointegration Test
  6. Data Modeling
    6.1 Moving Average (MA) Model
    6.2 Autoregressive (AR) Model
    6.3 Autoregressive (AR) Models with Additive Decomposition
    6.4 Autoregressive Moving Average (ARMA) Model
    6.5 Autoregressive Integrated Moving Average (ARIMA) Model
    6.6 Seasonal Autoregressive Integrated Moving Average (SARIMA) Model
    6.7 Seasonal Autoregressive Integrated Moving Average with Exogenous Variables (SARIMAX) Model
    6.8 Simple Exponential Smoothing (SES) Model
    6.9 Holt-Winters (HW) Model
    6.10 Prophet Univariate Time Series Model
    6.11 Vector Autoregressive (VAR) Model
    6.12 Vector Autoregressive Moving Average (VARMA) Model
    6.13 Vector Autoregressive Integrated Moving Average (VARIMA) Model
    6.14 Random Forest (RF) Regression Model
    6.15 Support Vector Regression (SVR) Model
  7. Evaluation
    7.1 Models Ranking
    7.2 Best Univariate Time Series Model
    7.3 Best Multivariate Time Series Model
    7.3 Research Question's Solution
  8. Deployment
  9. Conclusions
  10. References


1. Introduction¶


1.1 Background ¶

Walmart of Mexico (or WALMEX) is one of the most important retail companies within the region, with 3,903 stores in Mexico and Central America, an equity of 199,086,037 MXN, and a yearly revenue of 880,121,761 MXN, according to the figures from December 2023. According to WALMEX last financial report, its goal is to double its sales in a period of 10 years (Wal-Mart de México S.A.B. de C.V., 2024).

Time series are "a set of data points ordered in time" (Peixeiro, 2022), which can be analyzed to calculate forecasts and get valuable insights (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

Univariate time series is the most used approach when analyzing time series (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023), by means of models such as Moving Average (MA), Autoregressive Moving Average (ARMA), Autoregressive Integrated Moving Average (ARIMA), or Simple Exponential Smoothing; which solely depend on the time and the variable under study.

On the other hand, it is also possible to forecast time series using regression-based modeling, in which other variables or features are used to predict the response variable (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023). This approach could have the advantage of quantifying the impact of the external economic indicators in the performance of an organization.

In the case of Mexico, it is possible to collect public data from different government offices such as INEGI or BANXICO, or from international sources such as the S&P500, and to assess how they correlate to revenue.

In this context, it is desirable to explore both approaches to predict WALMEX net sales over the next years. Thus, the purpose of the present project is to forecast WALMEX net sales and, then, use that information to predict whether WALMEX will be able to achieve its long-term goal of doubling its sales within the next ten years.

1.2 General Objective ¶

To predict whether Walmart of Mexico will double its sales within the next ten years.

1.3 Research question ¶

Will Walmart of Mexico be able to double its sales within the next ten years? If so, when?

1.4 Hypothesis ¶

Walmart de México will manage to double its sales within the next ten years.

1.5 Methodology ¶

The methodology of the present study is based on the CRISP-DM (Chapman et al., 2000) framework and Rollin’s Foundational Methodology for Data Science (Rollins, 2015):

  1. Analytical approach: Building and evaluation of univariate and multivariate time series and regression models.
  2. Data requirements: Data about WALMEX's net sales, WALMEX stock value, IPC index (performance of the largest and most liquid stocks listed on the Mexican Stock Exchange), S&P500 index, MXN/USD exchange rate, bonds interest rate (CETES28), money market representative interest rates (28 day TIIE), inflation, and gross domestic product (GDP) of Mexico.
  3. Data collection: Data from a period of the last 10 years (from 01 Feb 2014 to 01 Feb 2024) was collected from Yahoo Finance, Walmex's investors website, INEGI, and Banxico.
  4. Data exploration: Data was explored with Python 3 and its libraries Numpy, Pandas, Matplotlib and Seaborn.
  5. Data preparation: Data was cleaned and prepared with Python 3 and its libraries Numpy and Pandas.
  6. Exploratory Data Analysis: Statistical measures, distributions, time trends, relationships, and correlations were assessed using Python 3 and its libraries Pandas, Matplotlib, and Seaborn.
  7. Data modeling: Ten univariate time series models were built and assessed in Python 3 and its libraries Statsmodels and Prophet to predict the net sales of WALMEX:

    • Moving Average (MA) model,
    • Autoregressive (AR) model,
    • a series of Autoregressive (AR) models with Additive Decomposition,
    • Autoregressive Moving Average (ARMA) model,
    • Autoregressive Integrated Moving Average (ARIMA) model,
    • Seasonal Autoregressive Integrated Moving Average (SARIMA) model,
    • Seasonal Autoregressive Integrated Moving Average with Exogenous Variables (SARIMAX) model,
    • Simple Exponential Smoothing (SES) model,
    • Holt-Winters (HW) model, and
    • Prophet Univariate Time Series Modeling.

      Then, three vector models were created and trained in Python 3 and its libraries Statsmodels and Darts to predict the values of the selected macroeconomic indicators as a multivariate time series:

    • Vector Autoregressive (VAR) model,
    • Vector Autoregressive Moving Average (VARMA) model, and
    • Vector Autoregressive Integrated Moving Average (VARIMA) model

      After that, two regression models were built using Random Forests and Support Vector Machines in Python 3 and its library Scikit-learn to predict WALMEX total sales based on the predictions for the best performing multivariate time series model.

      All the models were fit using a training set with 80% of the data, and assessed using a testing set with the remaining 20% of the data. The scores Root Mean Squared Error (RMSE), the Mean Absolute Error (MAE), and Coefficient of Determination $(r^{2})$ were used for model assessment.

  8. Evaluation: The different models were ranked in terms of the Root Mean Squared Error (RMSE), the Mean Absolute Error (MAE), and Coefficient of Determination $(r^{2})$. The best univariate, multivariate, and regression models were selected and retrained with all the historical data, and they were used to predict whether Walmart of Mexico would be able to double its sales within the next ten years.

  9. Deployment: The best forecasting model was deployed using Streamlit and Plotly.

1.6 Notebook's Goal ¶

In this context, the purpose of the present notebook is to collect, explore, prepare, model and evaluate the data from the different sources.

In [539]:
# Loading Requirements Text File
# %pip install -r requirements.txt
In [540]:
# Libraries installation
# %pip install ipykernel
# %pip install numpy
# %pip install pandas
# %pip install yfinance
# %pip install openpyxl
# %pip install matplotlib
# %pip install seaborn
# %python -m pip install statsmodels
# %pip install -U scikit-learn
# %pip install prophet
# %pip install darts
In [541]:
# Libraries importation

import json
import warnings
from itertools import product
from itertools import combinations_with_replacement
from urllib.request import urlopen
from typing import List, Tuple

import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline
from matplotlib.lines import Line2D
import seaborn as sns
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.vector_ar.vecm import coint_johansen
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.tsa.stattools import grangercausalitytests
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.model_selection import train_test_split
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.api import SimpleExpSmoothing
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from prophet import Prophet
from prophet.diagnostics import cross_validation
from prophet.diagnostics import performance_metrics
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.statespace.varmax import VARMAX
from darts.models.forecasting.varima import VARIMA
from darts import TimeSeries
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
In [542]:
# Setting theme and plot resolution
sns.set_theme(context = 'notebook', style = 'darkgrid')
mpl.rcParams["figure.dpi"] = 100
mpl.rcParams["savefig.dpi"] = 300
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

# Setting default plot's aesthetics
plotfontcolor = 'dimgray'
mpl.rcParams['text.color'] = plotfontcolor
mpl.rcParams['axes.labelcolor'] = plotfontcolor
mpl.rcParams['xtick.color'] = plotfontcolor
mpl.rcParams['ytick.color'] = plotfontcolor
mpl.rcParams["font.size"] = 10
mpl.rcParams['axes.titlesize'] = 14
mpl.rcParams['axes.labelsize'] = 12
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 11
mpl.rcParams["axes.labelweight"] = "bold"
mpl.rcParams["axes.titleweight"] = "bold"
#mpl.rcParams["legend.facecolor"] = "lightgray"
#mpl.rcParams["legend.edgecolor"] = "gray"
#mpl.rcParams["legend.framealpha"] = 0.8
#mpl.rcParams['font.family'] = 'sans-serif'
#mpl.rcParams['font.family'] = 'serif'

# Disabling warnings
warnings.filterwarnings('ignore')

# Color for plots
time_series_color = sns.color_palette('Blues_r')[0]
pred_color = '#C70039'


2. Data Collection¶


Firstly, WALMEX sales figures were retrieved from the Walmex's investors website. As the financial data was disclosed in a quaterly basis in 40 PDF files hosted on a variety of inconsistent links, for sake of efficiency, it was decided to collect the data manually and consolidate it in an Excel file.

The amount of files was sizeable for manual handling, and the complexity of developing a script for scraping and parsing each file was too high to just retrieve the account of Net sales. Thus, it was decided to proceed in a manual fashion.

Its important to note that net sales figures for WALMEX are in millions of Mexican pesos (mdp, for its acronym in Spanish).

In [543]:
# Loading sales data 
repo_link = 'https://github.com/DanielEduardoLopez/SalesPrediction/raw/main/Walmex_Quarterly_Net_Sales.xlsx'
sales_df = pd.read_excel(repo_link)
sales_df.head()
Out[543]:
Year Quarter Net Sales (mdp) Units Sq.mt. Mexico Sq.mt. Central America
0 2014 Q1 101405 2867 NaN NaN
1 2014 Q2 103300 2879 NaN NaN
2 2014 Q3 104367 2904 NaN NaN
3 2014 Q4 128586 2983 NaN NaN
4 2015 Q1 110875 2987 NaN NaN

Then, the stock close values of WALMEX, and the index values from the IPC and the S&P500 were retrieved from the API of Yahoo! Finance through the library yfinance.

In [544]:
# Retrieving market data for WALMEX, IPC and S&P500
stocks = ["WALMEX.MX", "^MXX", "^GSPC"]
stocks_df = yf.download(stocks, start="2014-02-01", end="2024-02-01")['Close']
stocks_df.head()
[*********************100%%**********************]  3 of 3 completed
Out[544]:
Ticker WALMEX.MX ^GSPC ^MXX
Date
2014-02-03 NaN 1741.890015 NaN
2014-02-04 31.500000 1755.199951 40085.519531
2014-02-05 31.379999 1751.640015 39880.871094
2014-02-06 31.190001 1773.430054 40288.781250
2014-02-07 29.850000 1797.020020 40525.738281

The currency of the WALMEX stock value is Mexican pesos (MXN).

Later, the GDP and inflation data were retrieved from INEGI's Query Constructor, saving the response JSON files into disk, and then loading them into the notebook.

It's important to note that GDP values are in millions of Mexican pesos at 2018 prices.

In [545]:
# Loading GDP data
gdp_link = "https://raw.githubusercontent.com/DanielEduardoLopez/SalesPrediction/main/gdp_data.json"

response = urlopen(gdp_link) 

gdp_json = json.loads(response.read())
In [546]:
# Retrieving series
gdp_dict = gdp_json["Series"][0]["OBSERVATIONS"]
In [547]:
# Converting dict into a dataframe
gdp_df = pd.DataFrame.from_dict(gdp_dict, orient='columns')
gdp_df.head()
Out[547]:
TIME_PERIOD OBS_VALUE OBS_EXCEPTION OBS_STATUS OBS_SOURCE OBS_NOTE COBER_GEO
0 2023/04 25596360.563 1 17 None 00
1 2023/03 25121269.269 17 None 00
2 2023/02 25001939.978 17 None 00
3 2023/01 24291956.149 17 None 00
4 2022/04 24981146.477 17 None 00

It is noteworthy that the GDP values are sorted descending according to the time period, which is inconsistent with the other datasets, so, the GDP dataset was sorted ascending.

In [548]:
gdp_df = gdp_df.sort_values(by='TIME_PERIOD', ascending=True).reset_index(drop=True)
gdp_df.head()
Out[548]:
TIME_PERIOD OBS_VALUE OBS_EXCEPTION OBS_STATUS OBS_SOURCE OBS_NOTE COBER_GEO
0 1980/01 10401367.607 17 None 00
1 1980/02 10342350 17 None 00
2 1980/03 10392733.012 17 None 00
3 1980/04 10927666.353 17 None 00
4 1981/01 11345848.491 17 None 00

Finally, the MXN/USD exchange rates, the bonds interest rates (CETES 28), and the money market representative interest rates (28 day TIIE) data were retrieved from Banxico's website in form of CSV files.

In [549]:
# Loading MXN/USD exchange rates data
exchange_rates_link = 'https://raw.githubusercontent.com/DanielEduardoLopez/SalesPrediction/main/exchange_rates_data.csv'
exchange_rates_df = pd.read_csv(exchange_rates_link, encoding = "ISO-8859-1")
exchange_rates_df.head(15)
Out[549]:
Banco de México Unnamed: 1
0 NaN NaN
1 Exchange rates and auctions historical informa... NaN
2 Tipos de cambio diarios NaN
3 NaN NaN
4 Date: 03/02/2024 06:14:08 NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 Title Exchange rate pesos per US dollar, Used to set...
9 Information type Levels
10 Date SF60653
11 01/01/2014 13.0652
12 01/02/2014 13.0652
13 01/03/2014 13.0843
14 01/04/2014 13.1011
In [550]:
# Loading bonds interest rates data
bonds_rates_link = 'https://raw.githubusercontent.com/DanielEduardoLopez/SalesPrediction/main/bonds_rates_data.csv'
bonds_rates_df = pd.read_csv(bonds_rates_link, encoding = "ISO-8859-1")
bonds_rates_df.head(15)
Out[550]:
Banco de México Unnamed: 1
0 NaN NaN
1 Securities auctions and Open Market Operations NaN
2 Valores Gubernamentales NaN
3 NaN NaN
4 Date: 03/02/2024 06:05:53 NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 Title Bonds Issued by Public Agencies Weekly auction...
9 Information type Levels
10 Date SF43936
11 01/02/2014 3.16
12 01/09/2014 3.17
13 01/16/2014 3.05
14 01/23/2014 3.13
In [551]:
# Loading bonds interest rates data
interest_rates_link = 'https://raw.githubusercontent.com/DanielEduardoLopez/SalesPrediction/main/interest_rates_data.csv'
interest_rates_df = pd.read_csv(interest_rates_link, encoding = "ISO-8859-1")
interest_rates_df.head(15)
Out[551]:
Banco de México Unnamed: 1
0 NaN NaN
1 Securities prices and interest rates NaN
2 Tasas de Interés en el Mercado de Dinero NaN
3 NaN NaN
4 Date: 03/02/2024 06:10:39 NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 Title 28 day TIIE, Interest rate in annual percent
9 Information type Levels
10 Date SF43783
11 01/02/2014 3.795
12 01/03/2014 3.795
13 01/06/2014 3.7925
14 01/07/2014 3.7962


3. Data Exploration¶


In this section, the collected datasets were explored to describe its attributes, number of records, shapes, data types, their quality in terms of its percentage of missing values and suspected extreme outliers; as well as to perform an initial exploration of statistical measures, time trends, distributions and relationships among variables.

3.1 Data Description ¶

Sales Dataset¶

In [552]:
# Preview of the dataset
sales_df.head()
Out[552]:
Year Quarter Net Sales (mdp) Units Sq.mt. Mexico Sq.mt. Central America
0 2014 Q1 101405 2867 NaN NaN
1 2014 Q2 103300 2879 NaN NaN
2 2014 Q3 104367 2904 NaN NaN
3 2014 Q4 128586 2983 NaN NaN
4 2015 Q1 110875 2987 NaN NaN
In [553]:
sales_df.tail()
Out[553]:
Year Quarter Net Sales (mdp) Units Sq.mt. Mexico Sq.mt. Central America
35 2022 Q4 236272 3745 6655421.0 815407.0
36 2023 Q1 204601 3755 6659477.0 818300.0
37 2023 Q2 212164 3775 6688133.0 818300.0
38 2023 Q3 211436 3802 6706432.0 819264.0
39 2023 Q4 251921 3903 6831809.0 821822.0
In [554]:
# Basic info of the dataset
sales_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Year                    40 non-null     int64  
 1   Quarter                 40 non-null     object 
 2   Net Sales (mdp)         40 non-null     int64  
 3   Units                   40 non-null     int64  
 4   Sq.mt. Mexico           27 non-null     float64
 5   Sq.mt. Central America  27 non-null     float64
dtypes: float64(2), int64(3), object(1)
memory usage: 2.0+ KB
In [555]:
# Basic statistical description
sales_df.describe()
Out[555]:
Year Net Sales (mdp) Units Sq.mt. Mexico Sq.mt. Central America
count 40.000000 40.000000 40.000000 2.700000e+01 27.000000
mean 2018.500000 160831.950000 3311.700000 6.393321e+06 780801.370370
std 2.908872 37308.578943 300.161512 2.190696e+05 39569.845038
min 2014.000000 101405.000000 2867.000000 6.035180e+06 689820.000000
25% 2016.000000 130929.000000 3041.500000 6.203962e+06 753962.000000
50% 2018.500000 160334.500000 3254.000000 6.403496e+06 799486.000000
75% 2021.000000 188321.250000 3545.750000 6.553819e+06 807843.500000
max 2023.000000 251921.000000 3903.000000 6.831809e+06 821822.000000
In [556]:
# Shape of dataset
sales_df.shape
Out[556]:
(40, 6)

Stock Dataset¶

In [557]:
# Preview of dataset
stocks_df.head()
Out[557]:
Ticker WALMEX.MX ^GSPC ^MXX
Date
2014-02-03 NaN 1741.890015 NaN
2014-02-04 31.500000 1755.199951 40085.519531
2014-02-05 31.379999 1751.640015 39880.871094
2014-02-06 31.190001 1773.430054 40288.781250
2014-02-07 29.850000 1797.020020 40525.738281
In [558]:
stocks_df.tail()
Out[558]:
Ticker WALMEX.MX ^GSPC ^MXX
Date
2024-01-25 69.279999 4894.160156 56160.070312
2024-01-26 70.699997 4890.970215 56855.878906
2024-01-29 70.349998 4927.930176 57175.730469
2024-01-30 71.919998 4924.970215 57537.140625
2024-01-31 71.089996 4845.649902 57372.761719
In [559]:
# Basic info
stocks_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2584 entries, 2014-02-03 to 2024-01-31
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   WALMEX.MX  2515 non-null   float64
 1   ^GSPC      2516 non-null   float64
 2   ^MXX       2514 non-null   float64
dtypes: float64(3)
memory usage: 80.8 KB
In [560]:
# Basic statistical description
stocks_df.describe()
Out[560]:
Ticker WALMEX.MX ^GSPC ^MXX
count 2515.000000 2516.000000 2514.000000
mean 52.796288 3030.773772 46462.182669
std 13.482746 909.474790 4687.222880
min 28.059999 1741.890015 32964.218750
25% 42.039999 2138.630066 43406.756836
50% 52.689999 2810.109985 46047.953125
75% 65.530003 3930.392578 49939.346680
max 81.919998 4927.930176 57745.789062
In [561]:
# Shape of dataset
stocks_df.shape
Out[561]:
(2584, 3)

So, the stock dataset comprises 2584 daily observations of the close value of WALMEX.MX, GSPC (S&P 500) and MXX (IPC).

GDP Dataset¶

In [562]:
# Preview of the dataset
gdp_df.head()
Out[562]:
TIME_PERIOD OBS_VALUE OBS_EXCEPTION OBS_STATUS OBS_SOURCE OBS_NOTE COBER_GEO
0 1980/01 10401367.607 17 None 00
1 1980/02 10342350 17 None 00
2 1980/03 10392733.012 17 None 00
3 1980/04 10927666.353 17 None 00
4 1981/01 11345848.491 17 None 00
In [563]:
gdp_df.tail()
Out[563]:
TIME_PERIOD OBS_VALUE OBS_EXCEPTION OBS_STATUS OBS_SOURCE OBS_NOTE COBER_GEO
171 2022/04 24981146.477 17 None 00
172 2023/01 24291956.149 17 None 00
173 2023/02 25001939.978 17 None 00
174 2023/03 25121269.269 17 None 00
175 2023/04 25596360.563 1 17 None 00
In [564]:
# Basic info of the dataset
gdp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176 entries, 0 to 175
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   TIME_PERIOD    176 non-null    object
 1   OBS_VALUE      176 non-null    object
 2   OBS_EXCEPTION  176 non-null    object
 3   OBS_STATUS     176 non-null    object
 4   OBS_SOURCE     176 non-null    object
 5   OBS_NOTE       0 non-null      object
 6   COBER_GEO      176 non-null    object
dtypes: object(7)
memory usage: 9.8+ KB
In [565]:
# Basic statistical description
gdp_df.describe()
Out[565]:
TIME_PERIOD OBS_VALUE OBS_EXCEPTION OBS_STATUS OBS_SOURCE OBS_NOTE COBER_GEO
count 176 176 176 176 176 0 176
unique 176 176 1 3 1 0 1
top 1980/01 10401367.607 17 NaN 00
freq 1 1 176 174 176 NaN 176
In [566]:
# Shape of dataset
gdp_df.shape
Out[566]:
(176, 7)

Exchange Rates Dataset¶

In [567]:
# Preview of the dataset
exchange_rates_df.head(10)
Out[567]:
Banco de México Unnamed: 1
0 NaN NaN
1 Exchange rates and auctions historical informa... NaN
2 Tipos de cambio diarios NaN
3 NaN NaN
4 Date: 03/02/2024 06:14:08 NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 Title Exchange rate pesos per US dollar, Used to set...
9 Information type Levels
In [568]:
# Preview of the dataset
exchange_rates_df.tail()
Out[568]:
Banco de México Unnamed: 1
3723 03/01/2024 17.0962
3724 03/02/2024 17.0633
3725 03/03/2024 17.0633
3726 03/04/2024 17.0633
3727 03/05/2024 17.0217
In [569]:
# Basic info of the dataset
exchange_rates_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3728 entries, 0 to 3727
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Banco de México  3723 non-null   object
 1   Unnamed: 1       3720 non-null   object
dtypes: object(2)
memory usage: 58.4+ KB
In [570]:
# Basic statistical description
exchange_rates_df.describe()
Out[570]:
Banco de México Unnamed: 1
count 3723 3720
unique 3723 2494
top Exchange rates and auctions historical informa... 20.1175
freq 1 7
In [571]:
# Shape of dataset
exchange_rates_df.shape
Out[571]:
(3728, 2)

Bonds Interest Rates Dataset¶

In [572]:
# Preview of the dataset
bonds_rates_df.head(10)
Out[572]:
Banco de México Unnamed: 1
0 NaN NaN
1 Securities auctions and Open Market Operations NaN
2 Valores Gubernamentales NaN
3 NaN NaN
4 Date: 03/02/2024 06:05:53 NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 Title Bonds Issued by Public Agencies Weekly auction...
9 Information type Levels
In [573]:
bonds_rates_df.tail()
Out[573]:
Banco de México Unnamed: 1
537 02/01/2024 11.15
538 02/08/2024 11.06
539 02/15/2024 11.05
540 02/22/2024 11
541 02/29/2024 11
In [574]:
# Basic info of the dataset
bonds_rates_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 542 entries, 0 to 541
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Banco de México  537 non-null    object
 1   Unnamed: 1       534 non-null    object
dtypes: object(2)
memory usage: 8.6+ KB
In [575]:
bonds_rates_df.describe()
Out[575]:
Banco de México Unnamed: 1
count 537 534
unique 537 301
top Securities auctions and Open Market Operations 3.05
freq 1 7
In [576]:
# Shape of dataset
bonds_rates_df.shape
Out[576]:
(542, 2)

Money Market Interest Rates Dataset¶

In [577]:
# Preview of the dataset
interest_rates_df.head(10)
Out[577]:
Banco de México Unnamed: 1
0 NaN NaN
1 Securities prices and interest rates NaN
2 Tasas de Interés en el Mercado de Dinero NaN
3 NaN NaN
4 Date: 03/02/2024 06:10:39 NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 Title 28 day TIIE, Interest rate in annual percent
9 Information type Levels
In [578]:
interest_rates_df.tail()
Out[578]:
Banco de México Unnamed: 1
2565 02/27/2024 11.4891
2566 02/28/2024 11.49
2567 02/29/2024 11.4875
2568 03/01/2024 11.4937
2569 03/04/2024 11.485
In [579]:
# Basic info of the dataset
interest_rates_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2570 entries, 0 to 2569
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Banco de México  2565 non-null   object
 1   Unnamed: 1       2562 non-null   object
dtypes: object(2)
memory usage: 40.3+ KB
In [580]:
interest_rates_df.describe()
Out[580]:
Banco de México Unnamed: 1
count 2565 2562
unique 2565 1730
top Securities prices and interest rates 3.3
freq 1 31
In [581]:
# Shape of dataset
interest_rates_df.shape
Out[581]:
(2570, 2)

3.2 Data Quality ¶

Missing values ¶

Missing data is common issue in real datasets. Thus, in the present section, the datasets were assessed to identify the number of missing values and its percentage.

In [582]:
# Function to calculate the percentage of missing values for each column in the dataset

def get_missing_values_summary(df: pd.DataFrame) -> pd.DataFrame:
        """
        Calculates the count and rate of missing values for each attribute in the dataset.

        Parameters:
        df (pandas.DataFrame): Input dataset.

        Returns
        missing_values_summary_table (pandas.DataFrame): Table with the count and rate of missing values for each attribute.

        """        
        missing_values_count = df.isnull().sum()

        missing_values_rate = (missing_values_count / len(df)) * 100

        missing_values_summary_table = pd.concat([pd.Series(missing_values_count.index), 
                                                  pd.Series(missing_values_count.values), 
                                                  pd.Series(missing_values_rate.values)], axis=1)
        
        missing_values_summary_table.columns = ['Attribute', 'Missing Values Count', 'Missing Values Rate (%)']

        missing_values_summary_table = missing_values_summary_table[missing_values_summary_table.iloc[:,1] != 0].sort_values(
                                            'Missing Values Rate (%)', ascending=False).round(2)
        
        print ("Dataset has " + str(df.shape[1]) + " attributes.\n"      
            "There are " + str(missing_values_summary_table.shape[0]) + " attributes that have missing values.")
        
        return missing_values_summary_table
In [583]:
# Mising data in sales dataset
get_missing_values_summary(sales_df)
Dataset has 6 attributes.
There are 2 attributes that have missing values.
Out[583]:
Attribute Missing Values Count Missing Values Rate (%)
4 Sq.mt. Mexico 13 32.5
5 Sq.mt. Central America 13 32.5
In [584]:
# Mising data in stocks dataset
get_missing_values_summary(stocks_df)
Dataset has 3 attributes.
There are 3 attributes that have missing values.
Out[584]:
Attribute Missing Values Count Missing Values Rate (%)
2 ^MXX 70 2.71
0 WALMEX.MX 69 2.67
1 ^GSPC 68 2.63
In [585]:
# Mising data in gdp dataset
get_missing_values_summary(gdp_df)
Dataset has 7 attributes.
There are 1 attributes that have missing values.
Out[585]:
Attribute Missing Values Count Missing Values Rate (%)
5 OBS_NOTE 176 100.0
In [586]:
# Mising data in exchange rates dataset
get_missing_values_summary(exchange_rates_df)
Dataset has 2 attributes.
There are 2 attributes that have missing values.
Out[586]:
Attribute Missing Values Count Missing Values Rate (%)
1 Unnamed: 1 8 0.21
0 Banco de México 5 0.13
In [587]:
# Mising data in bonds rates dataset
get_missing_values_summary(bonds_rates_df)
Dataset has 2 attributes.
There are 2 attributes that have missing values.
Out[587]:
Attribute Missing Values Count Missing Values Rate (%)
1 Unnamed: 1 8 1.48
0 Banco de México 5 0.92
In [588]:
# Mising data in interest rates dataset
get_missing_values_summary(interest_rates_df)
Dataset has 2 attributes.
There are 2 attributes that have missing values.
Out[588]:
Attribute Missing Values Count Missing Values Rate (%)
1 Unnamed: 1 8 0.31
0 Banco de México 5 0.19

So, several attributes in the datasets have missing values, the vast majority in a small extent. So, the method dropna() was sufficient to handle those missing values.

However, in the case of the attributes OBS_NOTE in the gdp dataset, or Sq.mt. Mexico and Sq.mt. Central America in the sales dataset, whose missing values rates (%) are higher than 30%, so they will be removed during the preparation step.

Outliers ¶

Likewise, the datasets were assessed to identify any extreme outliers according to the interquartile range criterion (NIST/SEMATECH, 2012).

In [589]:
# Function to calculate outliers based on the interquartile range
def count_outliers(series: pd.Series) -> Tuple[int, float]:
        """ 
        Returns the number of suspected extreme outliers and its rate based on the rule of: +-interquartile range * 3 (NIST/SEMATECH, 2012).

        Parameters:
        series (pandas.Series): Vector of numerical data.

        Returns:
        outliers_count (int): Number of suspected extreme outliers in the input data.
        outliers_percentage (float): Rate of suspected extreme outliers in the input data.

        """
        data_without_missing_values = series.dropna()

        q3, q1 = np.percentile(data_without_missing_values, [75, 25])

        iqr = q3 - q1

        lower_limit = q1 - iqr * 3
        upper_limit = q3 + iqr * 3

        mask = np.bitwise_and(data_without_missing_values > lower_limit, data_without_missing_values < upper_limit)

        data_count = data_without_missing_values.shape[0]

        outliers_count = data_count - data_without_missing_values.loc[mask].shape[0]

        outliers_percentage = round((outliers_count / data_count) * 100, 2)
        
        return outliers_count, outliers_percentage
In [590]:
def get_outliers_summary(df: pd.DataFrame) -> pd.DataFrame:
        """ 
        Calculates the count and rate of outliers for each numeric column in the dataset.

        Parameters:
        df (pandas.DataFrame): Input dataset.

        Returns:
        outliers_summary_table (pandas.DataFrame): Table with the number of suspected outliers and its corresponding rate for each column.

        """

        outliers_count_list = []

        outliers_percentage_list = []

        dataset_columns = list(df.select_dtypes(include='number').columns)

        for dataset_column in dataset_columns:

            outliers_count, outliers_percentage = count_outliers(df[dataset_column])

            outliers_count_list.append(outliers_count)
            outliers_percentage_list.append(outliers_percentage)
        
        outliers_dict = {'Attribute': dataset_columns, 
                        'Outliers Count':outliers_count_list, 
                        'Outliers Rate (%)': outliers_percentage_list}
        
        outliers_summary_table = pd.DataFrame(outliers_dict)    

        outliers_summary_table = (outliers_summary_table.loc[outliers_summary_table['Outliers Count']>0,:]
                                .sort_values(by='Outliers Count', ascending=False))

        print ("Dataset has " + str(df.shape[1]) + " columns.\n"      
        "There are " + str(outliers_summary_table.shape[0]) + " attributes that have suspected extreme outliers.")

        return outliers_summary_table
In [591]:
# Suspected outliers in sales dataset
get_outliers_summary(sales_df)
Dataset has 6 columns.
There are 0 attributes that have suspected extreme outliers.
Out[591]:
Attribute Outliers Count Outliers Rate (%)
In [592]:
# Suspected outliers in stocks dataset
get_outliers_summary(stocks_df)
Dataset has 3 columns.
There are 0 attributes that have suspected extreme outliers.
Out[592]:
Attribute Outliers Count Outliers Rate (%)
In [593]:
# Suspected outliers in GDP dataset
get_outliers_summary(gdp_df)
Dataset has 7 columns.
There are 0 attributes that have suspected extreme outliers.
Out[593]:
Attribute Outliers Count Outliers Rate (%)
In [594]:
# Suspected outliers in exchange rates dataset
get_outliers_summary(exchange_rates_df)
Dataset has 2 columns.
There are 0 attributes that have suspected extreme outliers.
Out[594]:
Attribute Outliers Count Outliers Rate (%)
In [595]:
# Suspected outliers in bonds rates dataset
get_outliers_summary(bonds_rates_df)
Dataset has 2 columns.
There are 0 attributes that have suspected extreme outliers.
Out[595]:
Attribute Outliers Count Outliers Rate (%)
In [596]:
# Suspected outliers in interest rates dataset
get_outliers_summary(interest_rates_df)
Dataset has 2 columns.
There are 0 attributes that have suspected extreme outliers.
Out[596]:
Attribute Outliers Count Outliers Rate (%)

Thus, the datasets are free from extreme outliers.

3.3 Initial Data Analysis ¶

In this section, the data was initialy analyzed to calculate simple statistical measures, identify time trends, explore distributions and relationships for each data source.

Statistical measures¶

The basic statistical measures of the sales dataset are shown below:

In [597]:
sales_df.drop(columns=['Year', 'Quarter']).describe()
Out[597]:
Net Sales (mdp) Units Sq.mt. Mexico Sq.mt. Central America
count 40.000000 40.000000 2.700000e+01 27.000000
mean 160831.950000 3311.700000 6.393321e+06 780801.370370
std 37308.578943 300.161512 2.190696e+05 39569.845038
min 101405.000000 2867.000000 6.035180e+06 689820.000000
25% 130929.000000 3041.500000 6.203962e+06 753962.000000
50% 160334.500000 3254.000000 6.403496e+06 799486.000000
75% 188321.250000 3545.750000 6.553819e+06 807843.500000
max 251921.000000 3903.000000 6.831809e+06 821822.000000

On the other hand, the basic statistical measures of the stocks dataset is as follows:

In [598]:
stocks_df.describe()
Out[598]:
Ticker WALMEX.MX ^GSPC ^MXX
count 2515.000000 2516.000000 2514.000000
mean 52.796288 3030.773772 46462.182669
std 13.482746 909.474790 4687.222880
min 28.059999 1741.890015 32964.218750
25% 42.039999 2138.630066 43406.756836
50% 52.689999 2810.109985 46047.953125
75% 65.530003 3930.392578 49939.346680
max 81.919998 4927.930176 57745.789062

The statistical measures of the GDP dataset are shown below:

In [599]:
gdp_df[['OBS_VALUE']].astype(float).describe()
Out[599]:
OBS_VALUE
count 1.760000e+02
mean 1.739801e+07
std 4.650498e+06
min 1.034235e+07
25% 1.270458e+07
50% 1.783120e+07
75% 2.148982e+07
max 2.559636e+07

Finally, the statistical measures of the exchange rates, bonds rates, and interest rates datasets are shown below:

In [600]:
exchange_rates_df.iloc[11:,1].astype(float).to_frame().describe()
Out[600]:
Unnamed: 1
count 3717.000000
mean 18.466568
std 2.395290
min 12.846200
25% 17.295700
50% 18.938500
75% 19.991300
max 25.118500
In [601]:
bonds_rates_df.iloc[11:,1].astype(float).to_frame().describe()
Out[601]:
Unnamed: 1
count 531.000000
mean 6.160000
std 2.591843
min 2.430000
25% 4.015000
50% 6.240000
75% 7.740000
max 11.400000
In [602]:
interest_rates_df.iloc[11:,1].astype(float).to_frame().describe()
Out[602]:
Unnamed: 1
count 2559.000000
mean 6.515711
std 2.587540
min 3.274100
25% 4.281500
50% 6.600000
75% 8.115950
max 11.566900

Time patterns¶

In [603]:
def plot_linechart(df: pd.DataFrame, is_saved: bool = True) -> None:
        """
        Function to plot a line chart for each attribute in dataset.

        Parameters:        
        df (pandas.DataFrame): Input dataset.
        is_saved (bool): Indicates whether the plot should be saved on local disk.

        Returns:
        None
        """
        
        columns = list(df.columns)

        for column in columns:

            plt.subplots(figsize=(7,5))
            
            sns.lineplot(data=df,
                        x=df.index,
                        y=df[column],             
                        legend=False
                        )
            
            xlabel=df.index.name.title()
            ylabel=df[column].name.title()

            y_ticks = plt.gca().get_yticks()
            plt.gca().set_yticklabels([f'{x:,.0f}' for x in y_ticks])

            plt.xlabel(xlabel)
            plt.ylabel(ylabel)
            plt.title(f'{ylabel} by {xlabel}')            
            # plt.xticks([])

            if is_saved:
                plt.savefig(f'Images/fig_{ylabel.lower().replace("^", "").replace(".", "").replace(" ", "_")}_over_time.png',  
                            bbox_inches = 'tight')
            

The time trends of the sales dataset are shown below:

In [604]:
plot_linechart(sales_df.drop(columns=['Year', 'Quarter']).rename_axis('Date'))
2024-11-01T17:43:23.979661 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:43:24.374660 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:43:24.707661 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:43:25.095661 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The net sales of WALMEX exhibits a positive trend with seasonality. Units, and commercial area in both Mexico and Central America showed a positive trend.

The time trends of the stocks dataset are plot below:

In [605]:
plot_linechart(stocks_df)
2024-11-01T17:43:31.046660 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:43:31.854663 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:43:32.441675 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The stock value of WALMEX and S&P500 have a strong positive trend; whereas the Mexican IPC has a weak one.

Moreover, the time trend of the GDP dataset is shown below:

In [606]:
chart_title = 'GDP Over Time'
plot_linechart(gdp_df[['OBS_VALUE']].astype(float).rename_axis('Date'))
plt.title(chart_title)
plt.xlabel('Time')
plt.ylabel('GDP (million of MXN)')
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:43:35.250662 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Mexican GDP showed a positive trend over the last 10 years with a sharply temporary decrease during the Covid pandemic.

The time trend of the exchange rates dataset is shown below:

In [607]:
chart_title = 'Exchange Rates Over Time'
plot_linechart(exchange_rates_df.iloc[11:,1].astype(float).to_frame().rename_axis('Date'))
plt.title(chart_title)
plt.xlabel('Time')
plt.ylabel('Exchange Rates (%)')
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:43:37.197670 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The exchange rate showed a weak positive trend over the last ten years.

The time trend of the bonds rates dataset is shown below:

In [608]:
chart_title = 'Bond Rates Over Time'
plot_linechart(bonds_rates_df.iloc[11:,1].astype(float).to_frame().rename_axis('Date'))
plt.title(chart_title)
plt.xlabel('Time')
plt.ylabel('Bond Rates (%)')
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:43:38.941659 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The time trend of the interest rates dataset is shown below:

In [609]:
chart_title = 'Interest Rates Over Time'
plot_linechart(interest_rates_df.iloc[11:,1].astype(float).to_frame().rename_axis('Date'))
plt.title(chart_title)
plt.xlabel('Time')
plt.ylabel('Interest Rates (%)')
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:43:40.206663 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Both bonds and money market interest rates showed an arbitrary trend, as those rates are set by the Central Bank of Mexico according to their contractive inflation policy.

Distributions¶

The distributions of the variables in the sales dataset are shown below by a means of histograms:

In [610]:
sns.histplot(sales_df['Net Sales (mdp)'])
plt.title('Distribution of Walmex Net Sales')
plt.xlabel('Net Sales (million of MXN)')
plt.ylabel('Count')
plt.show()
2024-11-01T17:43:40.690660 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The net sales of WALMEX shows a normal distribution somewhat skewed to the right.

The distributions of the variables of the stocks dataset are shown below by a means of histograms:

In [611]:
sns.histplot(stocks_df['WALMEX.MX'])
plt.title('Distribution of WALMEX Stock Price')
plt.xlabel('Stock Price (MXN)')
plt.ylabel('Count')
plt.show()
2024-11-01T17:43:41.127660 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The stock value for WALMEX seems to show three diferent distributions.

In [612]:
sns.histplot(stocks_df['^GSPC'])
plt.title('Distribution of S&P 500 Stock Price')
plt.xlabel('Stock Price (USD)')
plt.ylabel('Count')
plt.show()
2024-11-01T17:43:41.651663 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The stock value of the S&P500 is skewed to the right.

In [613]:
sns.histplot(stocks_df['^MXX'])
plt.title('Distribution of IPC Stock Price')
plt.xlabel('Stock Price (MXN)')
plt.ylabel('Count')
plt.show()
2024-11-01T17:43:42.379661 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The stock value of the IPC exhibits a normal distribution.

In [614]:
sns.histplot(sales_df['Units'])
plt.title('Distribution of Walmex Units')
plt.xlabel('Units')
plt.ylabel('Count')
plt.show()
2024-11-01T17:43:42.799661 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The units data of WALMEX seems to show a uniform distribution.

In [615]:
sns.histplot(sales_df['Sq.mt. Mexico'])
plt.title('Distribution of Walmex Commercial Area in Mexico')
plt.xlabel(r'Area ($m^{2}$)')
plt.ylabel('Count')
plt.show()
2024-11-01T17:43:43.412662 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The commercial area for Mexico barely resembles a normal distribuition.

In [616]:
sns.histplot(sales_df['Sq.mt. Central America'])
plt.title('Distribution of Walmex Commercial Area in Central America')
plt.xlabel(r'Area ($m^{2}$)')
plt.ylabel('Count')
plt.show()
2024-11-01T17:43:43.853661 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The commercial area for Central America data is skewed to the left.

The distribution of the GDP dataset is shown below:

In [617]:
# Exploring distribution with histogram
sns.histplot(gdp_df[['OBS_VALUE']].astype(float))
plt.title('Distribution of Historic Values of GDP in Mexico')
plt.xlabel('GDP (million of MXN)')
plt.ylabel('Count')
plt.legend('')
plt.show()
2024-11-01T17:43:44.431660 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The distribution of the GDP is skewed to the right.

The distribution of the exchange rates dataset is shown below:

In [618]:
# Exploring distribution with histogram
sns.histplot(exchange_rates_df.iloc[11:,1].astype(float).to_frame())
plt.title('Distribution of Historic Exchange Rates in Mexico')
plt.xlabel('Rate (%)')
plt.ylabel('Count')
plt.legend('')
plt.show()
2024-11-01T17:43:44.963663 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The exchange rates dataset shows two distributions. The first distribution (at the left of the histogram) is skewed to the right, and the second one (at the center of the histogram) resembles a normal distribution.

The distribution of the bonds rates dataset is shown below:

In [619]:
# Exploring distribution with histogram
sns.histplot(bonds_rates_df.iloc[11:,1].astype(float).to_frame())
plt.title('Distribution of Historic Bond Rates in Mexico')
plt.xlabel('Rate (%)')
plt.ylabel('Count')
plt.legend('')
plt.show()
2024-11-01T17:43:45.404662 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

No distribution is noticeable in the bonds rates dataset.

The distribution of the interest rates dataset is shown below:

In [620]:
# Exploring distribution with histogram
sns.histplot(interest_rates_df.iloc[11:,1].astype(float).to_frame())
plt.title('Distribution of Historic Interest Rates in Mexico')
plt.xlabel('Rate (%)')
plt.ylabel('Count')
plt.legend('')
plt.show()
2024-11-01T17:43:46.060662 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Likewise, no distribution is noticeable in the money market interest rates dataset.

Correlations¶

The distribution and variables's relationships of the sales dataset is shown below by a means of a pairplot:

In [621]:
# Exploring distributions and correlations with pairplot
sns.pairplot(stocks_df)
plt.show()
2024-11-01T17:43:50.192676 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The stock value of WALMEX and the S&P500 exhibited a strong positive correlation; whereas the weak positive correlation was observed between the stock value of WALMEX and the Mexican IPC.

A very weak positive correlation was seen between the S&P500 and the IPC.

The distribution and variables's relationships of the sales dataset is shown below by a means of a pairplot:

In [622]:
# Exploring distributions and correlations with pairplot
sns.pairplot(sales_df.drop(columns=['Year', 'Quarter']))
Out[622]:
<seaborn.axisgrid.PairGrid at 0x2283c8cecd0>
2024-11-01T17:43:59.570659 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

A strong positive correlation was observed for all the variables net sales, units, and commercial area in Mexico and Central America.

Then, as both are quaterly-reported data, the net sales and the GDP were plot in a scatter plot to explore its relationship.

In [623]:
sns.regplot(x=sales_df["Net Sales (mdp)"], y=gdp_df[['OBS_VALUE']].astype(float).iloc[-40:,:])
plt.title('Correlation Between GDP and Net Sales')
plt.xlabel('Net Sales (million of MXN)')
plt.ylabel('GDP (million of MXN)')
plt.show()
2024-11-01T17:44:00.854665 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

A positive correlation was found between net sales and GDP.

Finally, a new dataframe with the exchange, bonds, and interest rates data was built to later explore its relationships.

In [624]:
rates_df = exchange_rates_df.copy()
rates_df = rates_df.iloc[11:].reset_index(drop=True).rename(columns={'Unnamed: 1':'exchange_rate'})
rates_df = pd.merge(rates_df, 
                    bonds_rates_df.iloc[11:].rename(columns={'Unnamed: 1':'bonds_rate'}), 
                    how='left', 
                    on='Banco de México')
rates_df = pd.merge(rates_df, 
                    interest_rates_df.iloc[11:].rename(columns={'Unnamed: 1':'interest_rate'}), 
                    how='left', 
                    on='Banco de México')
rates_df = rates_df.rename(columns={'Banco de México':'date'})
rates_df.head(10)
Out[624]:
date exchange_rate bonds_rate interest_rate
0 01/01/2014 13.0652 NaN NaN
1 01/02/2014 13.0652 3.16 3.795
2 01/03/2014 13.0843 NaN 3.795
3 01/04/2014 13.1011 NaN NaN
4 01/05/2014 13.1011 NaN NaN
5 01/06/2014 13.1011 NaN 3.7925
6 01/07/2014 13.0911 NaN 3.7962
7 01/08/2014 13.0905 NaN 3.7994
8 01/09/2014 13.0337 3.17 3.8005
9 01/10/2014 13.0846 NaN 3.7927
In [625]:
sns.pairplot(rates_df.drop(columns=['date']).astype(float))
Out[625]:
<seaborn.axisgrid.PairGrid at 0x22849d68cd0>
2024-11-01T17:44:06.378661 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

As expectable, a strong positive correlation was found for the bonds rates and the money market interest rates.

On the other hand, no relationship was observed among the exchange rates and the bonds or interest rates.


4. Data Preparation¶


After the data was explored, it was wrangled to build the appropriate dataset for modeling based on the purposes of the present study, and the quality of the data.

In this context, the following transformations were performed:

  • Select appropriate data and drop unnecessary attributes.
  • Rename attributes.
  • Alter data types.
  • Handle missing values.
  • Construct derived attributes.
  • Group values to the same level of granularity.
  • Join the diferent datasets.

4.1 Data Selection ¶

The sales dataset contains quarter net sales data and number of units; as well as commercial area in $m^{2}$ in both Mexico and Central America.

Net sales corresponds to the response variable in the model, so it was selected.

While the information for net sales and number of units is complete for the selected time period from 1 Feb 2014 to 1 Feb 2024, it was found that the attibutes for commercial area in $m^{2}$ in both Mexico and Central America suffer from a 32.5% of missing values. So it was decided to select only number of units from the sales dataset as a predictor in the model.

In [626]:
sales_df = sales_df.drop(columns=['Sq.mt. Mexico', 'Sq.mt. Central America'])

On the other hand, the stock dataset contains information regarding the stock value of the WALMEX, S&P500 and the IPC.

From the initial data analysis, it was found a strong positive relationship between the stock value of WALMEX and the S&P500; whereas the relationship between the stock value of WALMEX and the IPC was weaker.

However, as WALMEX operates mostly within the Mexican market, it was decided to select not only the S&P500 as a predictor but the IPC, too.

From the GDP dataset, the attributes 'TIME_PERIOD', 'OBS_VALUE', 'OBS_EXCEPTION', 'OBS_STATUS', 'OBS_SOURCE', 'OBS_NOTE', and 'COBER_GEO' were present in the dataset; where 'OBS_VALUE' was the quarterly estimation of the GDP in Mexico. In this sense, from the initial data analysis a positive correlation was found between the net sales of WALMEX and GDP. So, OBS VALUE was selected as a predictor in the model, along with the temporality attribute 'TIME_PERIOD'.

In [627]:
gdp_df = gdp_df[['TIME_PERIOD', 'OBS_VALUE']]

Then, from the exchange rates dataset, no meaningful correlation could be established between the historical exchange rates and the net sales of WALMEX. However, for sake of completeness, the exchange rates were selected as a predictor in the model.

Lastly, regarding the bonds rates and the interest rates datasets, it was found that the bonds rates and the money market interest rate were heavily correlated. So, to avoid multicollinearity issues within the model, only one variable from those two was selected.

Taking into account that the money market interest rates are more relevant to the proper functioning of businesses in the country in terms to access to capital from the financial system, and that the bonds rate dataset exhibited a higher percentage of missing values, it was decided to select only interest rates as a predictor in the model.

4.2 Attributes Renaming ¶

For the sales dataset, the attributes were renamed as follows:

In [628]:
sales_df = sales_df.rename(columns={'Year':'year', 
                                    'Quarter':'quarter', 
                                    'Net Sales (mdp)': 'net_sales', 
                                    'Units': 'units'})

For the stock dataset, the attributes were renamed as follows:

In [629]:
stocks_df = stocks_df.rename(columns={"WALMEX.MX": "walmex", "^MXX": "ipc", "^GSPC": "sp500"})

For the GDP dataset, the attributes were renamed as follows:

In [630]:
gdp_df = gdp_df.rename(columns={'TIME_PERIOD': 'year_quarter', 'OBS_VALUE': 'gdp'})

For the exchange rates dataset, the attributes were renamed as follows:

In [631]:
exchange_rates_df = exchange_rates_df.rename(columns={'Banco de México': 'date', 'Unnamed: 1': 'exchange_rates'})

For the interest rates dataset, the attributes were renamed as follows:

In [632]:
interest_rates_df = interest_rates_df.rename(columns={'Banco de México': 'date', 'Unnamed: 1': 'interest_rates'})

4.3 Data Types Adjustment ¶

In this section, the data types of the different attributes from the datasets were adjusted.

In [633]:
sales_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   year       40 non-null     int64 
 1   quarter    40 non-null     object
 2   net_sales  40 non-null     int64 
 3   units      40 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 1.4+ KB

The attribute year in the sales dataset was transformed to a string data type, as it will further concatenated with the quarter attribute. Moreover, no math aggregations are applicable for year within the present study.

In [634]:
sales_df.year = sales_df.year.astype(str)
In [635]:
stocks_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2584 entries, 2014-02-03 to 2024-01-31
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   walmex  2515 non-null   float64
 1   sp500   2516 non-null   float64
 2   ipc     2514 non-null   float64
dtypes: float64(3)
memory usage: 80.8 KB

The attributes in the stock dataset have the proper data types. So no altering was required.

In [636]:
gdp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176 entries, 0 to 175
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   year_quarter  176 non-null    object
 1   gdp           176 non-null    object
dtypes: object(2)
memory usage: 2.9+ KB

The attribute gdp in the GDP dataset was transformed to a float data type.

In [637]:
gdp_df.gdp = gdp_df.gdp.astype('float64')
In [638]:
exchange_rates_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3728 entries, 0 to 3727
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   date            3723 non-null   object
 1   exchange_rates  3720 non-null   object
dtypes: object(2)
memory usage: 58.4+ KB

The attributes date and exchange_rates from the exchange rates dataset were transformed to date and float data types, respectively. To do so, first, all the non-numeric values values in the attributes were replaced by NaN values to avoid casting errors.

In [639]:
exchange_rates_df.date = pd.to_datetime(exchange_rates_df.date, errors='coerce').dt.date
exchange_rates_df.exchange_rates = exchange_rates_df.exchange_rates.replace({'[^0-9.]': np.nan}, regex=True).astype('float64')
In [640]:
interest_rates_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2570 entries, 0 to 2569
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   date            2565 non-null   object
 1   interest_rates  2562 non-null   object
dtypes: object(2)
memory usage: 40.3+ KB

Likewise, the attributes date and interest_rates from the interest rates dataset were transformed to date and float data types, respectively. Likewise, the non-numeric values values in the attributes were replaced by NaN values.

In [641]:
interest_rates_df.date = pd.to_datetime(interest_rates_df.date, errors='coerce').dt.date
interest_rates_df.interest_rates = interest_rates_df.interest_rates.replace({'[^0-9.]': np.nan}, regex=True).astype('float64')

4.4 Missing Values Handling ¶

In the different datasets, the missing values were drop, as the data will be aggregated in a later step on a quarterly basis.

In [642]:
sales_df = sales_df.dropna()
stocks_df = stocks_df.dropna()
gdp_df = gdp_df.dropna()
exchange_rates_df = exchange_rates_df.dropna()
interest_rates_df = interest_rates_df.dropna()

4.5 Derived Attributes Construction ¶

The datasets have different reporting periods:

  • Sales dataset: Quarterly.
  • Stocks dataset: Daily.
  • GDP dataset: Quarterly.
  • Exchange rates dataset: Daily.
  • Interest rates dataset: Daily.

In this context, it was decided to add a new derived attribute year_quarter for the stocks, exchange rates, and interest rates datasets. This, in order to be able to match them with the quarterly reported datasets.

In [643]:
# Adding a year_quarter attribute to the stocks dataset
stocks_df['year'] = stocks_df.index.year
stocks_df['year_quarter'] = pd.PeriodIndex(stocks_df.index, freq='Q')
In [644]:
stocks_df.head()
Out[644]:
Ticker walmex sp500 ipc year year_quarter
Date
2014-02-04 31.500000 1755.199951 40085.519531 2014 2014Q1
2014-02-05 31.379999 1751.640015 39880.871094 2014 2014Q1
2014-02-06 31.190001 1773.430054 40288.781250 2014 2014Q1
2014-02-07 29.850000 1797.020020 40525.738281 2014 2014Q1
2014-02-10 29.740000 1799.839966 40116.359375 2014 2014Q1
In [645]:
# Adding a year_quarter attribute to the exchange rates dataset
exchange_rates_df.date = pd.to_datetime(exchange_rates_df.date)
exchange_rates_df = exchange_rates_df.set_index('date')
exchange_rates_df['year'] = exchange_rates_df.index.year
exchange_rates_df['year_quarter'] = pd.PeriodIndex(exchange_rates_df.index, freq='Q')
In [646]:
# Adding a year_quarter attribute to the interest rates dataset
interest_rates_df.date = pd.to_datetime(interest_rates_df.date)
interest_rates_df = interest_rates_df.set_index('date')
interest_rates_df['year'] = interest_rates_df.index.year
interest_rates_df['year_quarter'] = pd.PeriodIndex(interest_rates_df.index, freq='Q')

Moreover, the nomenclature of the year_quarter attribute in the GDP dataset was harmonized with the one created by the method PeriodIndex from Pandas.

In [647]:
gdp_df.head()
Out[647]:
year_quarter gdp
0 1980/01 1.040137e+07
1 1980/02 1.034235e+07
2 1980/03 1.039273e+07
3 1980/04 1.092767e+07
4 1981/01 1.134585e+07
In [648]:
gdp_df.year_quarter = gdp_df.year_quarter.apply(lambda x: x.replace('/0', 'Q'))
In [649]:
gdp_df.head()
Out[649]:
year_quarter gdp
0 1980Q1 1.040137e+07
1 1980Q2 1.034235e+07
2 1980Q3 1.039273e+07
3 1980Q4 1.092767e+07
4 1981Q1 1.134585e+07

Finally, a new attribute year_quarter was introduced in the sales dataset:

In [650]:
sales_df['year_quarter'] = sales_df.year + sales_df.quarter
In [651]:
sales_df.head()
Out[651]:
year quarter net_sales units year_quarter
0 2014 Q1 101405 2867 2014Q1
1 2014 Q2 103300 2879 2014Q2
2 2014 Q3 104367 2904 2014Q3
3 2014 Q4 128586 2983 2014Q4
4 2015 Q1 110875 2987 2015Q1

4.6 Data Aggregation ¶

The sales and GDP datasets are reported on a quarterly basis. So, it was decided to aggregate the stocks, exchange rates and interest rates datasets on a quarterly basis to harmonize the reporting time period among all datasets, and further join them in a next step.

As aggregation calculation, the mean was used for the three datasets.

In [652]:
# Aggregating stocks dataset on a quarterly basis
stocks_df = (stocks_df.groupby(by='year_quarter', as_index=False).mean()
             .drop(columns='year').reset_index(drop=True))
stocks_df.head()
Out[652]:
Ticker year_quarter walmex sp500 ipc
0 2014Q1 29.680789 1843.603426 39528.067331
1 2014Q2 33.320492 1901.228686 41669.394467
2 2014Q3 34.399206 1975.541901 44788.734003
3 2014Q4 30.782623 2007.633117 43557.017354
4 2015Q1 33.806780 2064.119323 42891.938957
In [653]:
# Aggregating exchange rates dataset on a quarterly basis
exchange_rates_df = (exchange_rates_df.groupby(by='year_quarter', as_index=False).mean()
                    .drop(columns='year').reset_index(drop=True))
exchange_rates_df.head()
Out[653]:
year_quarter exchange_rates
0 2014Q1 13.233682
1 2014Q2 13.002503
2 2014Q3 13.110359
3 2014Q4 13.841998
4 2015Q1 14.930269
In [654]:
# Aggregating exchange rates dataset on a quarterly basis
interest_rates_df = (interest_rates_df.groupby(by='year_quarter', as_index=False).mean()
                     .drop(columns='year').reset_index(drop=True))
interest_rates_df.head()
Out[654]:
year_quarter interest_rates
0 2014Q1 3.789890
1 2014Q2 3.677145
2 2014Q3 3.295198
3 2014Q4 3.293132
4 2015Q1 3.300654

4.7 Data Integration ¶

Finally, all the datasets were joined into a single dataset using the year_quarter attribute.

In [655]:
# Adjusting Datatypes on primary keys
stocks_df.year_quarter = stocks_df.year_quarter.astype(str)
exchange_rates_df.year_quarter = exchange_rates_df.year_quarter.astype(str)
interest_rates_df.year_quarter = interest_rates_df.year_quarter.astype(str)
In [656]:
# Merging datasets
df = (sales_df.merge(gdp_df, how='inner', on='year_quarter')
      .merge(stocks_df, how='inner', on='year_quarter')
      .merge(exchange_rates_df, how='inner', on='year_quarter')
      .merge(interest_rates_df, how='inner', on='year_quarter')
      )
df.head()
Out[656]:
year quarter net_sales units year_quarter gdp walmex sp500 ipc exchange_rates interest_rates
0 2014 Q1 101405 2867 2014Q1 2.154765e+07 29.680789 1843.603426 39528.067331 13.233682 3.789890
1 2014 Q2 103300 2879 2014Q2 2.232009e+07 33.320492 1901.228686 41669.394467 13.002503 3.677145
2 2014 Q3 104367 2904 2014Q3 2.225821e+07 34.399206 1975.541901 44788.734003 13.110359 3.295198
3 2014 Q4 128586 2983 2014Q4 2.293982e+07 30.782623 2007.633117 43557.017354 13.841998 3.293132
4 2015 Q1 110875 2987 2015Q1 2.217008e+07 33.806780 2064.119323 42891.938957 14.930269 3.300654
In [657]:
# Drop year and quarter columns
df = df.drop(columns=['year', 'quarter']).set_index('year_quarter')
df.head()
Out[657]:
net_sales units gdp walmex sp500 ipc exchange_rates interest_rates
year_quarter
2014Q1 101405 2867 2.154765e+07 29.680789 1843.603426 39528.067331 13.233682 3.789890
2014Q2 103300 2879 2.232009e+07 33.320492 1901.228686 41669.394467 13.002503 3.677145
2014Q3 104367 2904 2.225821e+07 34.399206 1975.541901 44788.734003 13.110359 3.295198
2014Q4 128586 2983 2.293982e+07 30.782623 2007.633117 43557.017354 13.841998 3.293132
2015Q1 110875 2987 2.217008e+07 33.806780 2064.119323 42891.938957 14.930269 3.300654
In [658]:
df.tail()
Out[658]:
net_sales units gdp walmex sp500 ipc exchange_rates interest_rates
year_quarter
2022Q4 236272 3745 2.498115e+07 73.022500 3849.569010 49374.139648 19.699096 10.029444
2023Q1 204601 3755 2.429196e+07 72.581333 3999.022510 53196.447331 18.704144 11.071640
2023Q2 212164 3775 2.500194e+07 70.336667 4208.393827 54292.645443 17.723184 11.521205
2023Q3 211436 3802 2.512127e+07 67.776825 4458.137447 53210.645523 17.058150 11.501251
2023Q4 251921 3903 2.559636e+07 66.062000 4463.005339 52375.833008 17.582583 11.504218

Finally, the index of the integrated dataset was converted to a date data type to ease the application of the time series modeling below.

In [659]:
df.index = pd.to_datetime(df.index)
df = df.rename_axis('date')

Now, the dataset is ready for the Exploratory Data Analysis.

In [660]:
# Exporting processed dataset
df.to_csv('dataset_processed.csv', index=True)

4.8 Predictors and Response Split ¶

The dataset for the time series analysis was split into predictors ($X$) and the response variable ($y$) as follows for easing the subsequent statistical tests before modeling:

In [661]:
X = df[df.columns.difference(['net_sales'], sort=False)]
y = df['net_sales']
In [662]:
X.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 40 entries, 2014-01-01 to 2023-10-01
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   units           40 non-null     int64  
 1   gdp             40 non-null     float64
 2   walmex          40 non-null     float64
 3   sp500           40 non-null     float64
 4   ipc             40 non-null     float64
 5   exchange_rates  40 non-null     float64
 6   interest_rates  40 non-null     float64
dtypes: float64(6), int64(1)
memory usage: 2.5 KB
In [663]:
X_vals = X.values
y_vals = y.values


5. Exploratory Data Analysis¶


In this section, the data was explored to calculate simple statistical measures, identify time trends, explore distributions and relationships from the unified dataset as well as assessing stationarity, autocorrelation, decomposition and cointegration among attributes.

5.1 Statistical Measures ¶

In [664]:
df.describe()
Out[664]:
net_sales units gdp walmex sp500 ipc exchange_rates interest_rates
count 40.000000 40.000000 4.000000e+01 40.000000 40.000000 40.000000 40.000000 40.000000
mean 160831.950000 3311.700000 2.349702e+07 52.416249 3006.016493 46309.976007 18.490338 6.431333
std 37308.578943 300.161512 1.136271e+06 13.595784 905.561387 4472.673823 2.382935 2.550068
min 101405.000000 2867.000000 1.920180e+07 29.680789 1843.603426 36401.019723 13.002503 3.293132
25% 130929.000000 3041.500000 2.299367e+07 42.609978 2147.388925 43350.607976 17.688033 4.242145
50% 160334.500000 3254.000000 2.347919e+07 52.997484 2793.243389 46092.678673 19.051471 6.221824
75% 188321.250000 3545.750000 2.422661e+07 64.671428 3894.647899 49392.656242 20.017630 8.143081
max 251921.000000 3903.000000 2.559636e+07 74.324166 4600.347089 54292.645443 23.365513 11.521205

5.2 Time Patterns ¶

A high-level view of the time series for all the features is shown below:

In [665]:
# List of indexes
col_list = range(len(df.columns))

# Adjusting figure size
plt.figure(figsize=(8,7))

# Loop
i = 1
for col in col_list:   

    # Plot
    ax = plt.subplot(len(col_list), 1, i)        
    plt.plot(df.iloc[:,col])    
    plt.title(df.columns[col], y=0.5, loc='left')    
    plt.grid(False)

    # Removing labels and ticks
    plt.ylabel('')
    plt.xlabel('')
    #plt.yticks([])
    plt.xticks([])

    i += 1

# Adjusting ticks in plot
# max_xticks = 15
# xloc = plt.MaxNLocator(max_xticks)
# ax.xaxis.set_major_locator(xloc)
# plt.xticks(rotation=45)

# Adding a x label
plt.xlabel('Time')
plt.savefig(f'Images/fig_{"variables over time".lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:44:12.280725 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

However, individual line charts were plot to assess with more detail the time trends and seasonality for each feature:

In [666]:
plot_linechart(df)
2024-11-01T17:44:23.507720 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:44:24.185710 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:44:24.749726 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:44:25.330711 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:44:25.880727 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:44:26.449725 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:44:27.034709 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:44:27.673720 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the plots above, it can be seen that all the variables exhibited a positive trend over time.

5.3 Distributions and Relationships ¶

In [667]:
sns.pairplot(df)
plt.savefig(f'Images/fig_{"variables distributions and relationships".lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
2024-11-01T17:45:10.727252 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the plot above, none of the predictors followed a normal distribution.

Moreover, it can be seen that the variables units, S&P500, WALMEX stock, and net sales have a positive relationship; whereas the GDP and interest rates showed a moderate positive relationship.

5.4 Correlations ¶

In [668]:
df.corr()
Out[668]:
net_sales units gdp walmex sp500 ipc exchange_rates interest_rates
net_sales 1.000000 0.941812 0.591655 0.881404 0.876755 0.497426 0.545840 0.706790
units 0.941812 1.000000 0.471083 0.951423 0.950654 0.481692 0.539032 0.665831
gdp 0.591655 0.471083 1.000000 0.458073 0.435537 0.668256 0.103462 0.690051
walmex 0.881404 0.951423 0.458073 1.000000 0.949792 0.539263 0.613590 0.597886
sp500 0.876755 0.950654 0.435537 0.949792 1.000000 0.575764 0.524288 0.536097
ipc 0.497426 0.481692 0.668256 0.539263 0.575764 1.000000 0.070975 0.505446
exchange_rates 0.545840 0.539032 0.103462 0.613590 0.524288 0.070975 1.000000 0.317116
interest_rates 0.706790 0.665831 0.690051 0.597886 0.536097 0.505446 0.317116 1.000000
In [669]:
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(df.corr(), cmap=cmap, center=0.5, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
plt.title('Correlation Matrix')
plt.xlabel('Attributes')
plt.ylabel('Attributes')
plt.savefig(f'Images/fig_{"correlation matrix".lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:45:16.433826 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Indeed, from the table above, it can be seen that the correlation among net sales, units, WALMEX stock, and S&P500 is strong; whereas GDP, interest rates and IPC are moderately correlated.

In contrast, exchange rates, GDP and IPC are weakly correlated.

5.5 Differencing ¶

As strong correlations among the attributes in the dataset were found, it was decided to calculate the differences among observations for each variable.

In [670]:
df.diff()
Out[670]:
net_sales units gdp walmex sp500 ipc exchange_rates interest_rates
date
2014-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
2014-04-01 1895.0 12.0 772439.473 3.639703 57.625260 2141.327136 -0.231179 -0.112745
2014-07-01 1067.0 25.0 -61883.856 1.078714 74.313215 3119.339536 0.107855 -0.381947
2014-10-01 24219.0 79.0 681607.627 -3.616583 32.091216 -1231.716649 0.731639 -0.002067
2015-01-01 -17711.0 4.0 -769739.789 3.024157 56.486206 -665.078397 1.088271 0.007522
2015-04-01 2675.0 6.0 673107.361 4.392073 38.175439 2099.669970 0.379645 0.000254
2015-07-01 2908.0 11.0 168436.267 1.374798 -74.588570 -1127.769393 1.091249 0.016643
2015-10-01 28523.0 61.0 436115.112 4.159252 23.757037 16.158072 0.348193 0.029321
2016-01-01 -19467.0 -9.0 -911954.852 -0.622731 -103.225304 -638.493498 1.266760 0.443094
2016-04-01 1094.0 15.0 823087.967 -0.458297 126.893481 2230.406048 0.035913 0.290438
2016-07-01 -733.0 -94.0 -144280.433 0.836220 87.288906 1679.088108 0.666134 0.506849
2016-10-01 30640.0 45.0 770137.090 -4.150428 22.484175 -668.554319 1.108795 0.842175
2017-01-01 -24805.0 6.0 -674041.839 -0.635500 140.787174 856.216927 0.562376 0.966881
2017-04-01 3082.0 18.0 403515.674 4.053571 73.568811 1904.803587 -1.790742 0.633231
2017-07-01 11843.0 159.0 -222953.221 -0.271452 67.914152 1606.913207 -0.776203 0.343346
2017-10-01 20480.0 -71.0 829054.680 1.899047 135.831554 -2131.552536 1.111891 0.045699
2018-01-01 -23162.0 22.0 -768826.439 1.743736 133.879166 50.576675 -0.170510 0.326388
2018-04-01 1523.0 26.0 830388.254 4.231661 -32.986940 -1909.600156 0.608665 0.127860
2018-07-01 1159.0 23.0 -243029.643 3.691905 145.703024 2302.189422 -0.394126 0.235060
2018-10-01 29488.0 44.0 495679.257 -2.106301 -150.142597 -5061.005004 0.850446 0.140761
2019-01-01 -25589.0 10.0 -820271.023 -1.607554 19.582852 -1046.752090 -0.607683 0.310076
2019-04-01 2615.0 32.0 398092.413 4.108466 162.235186 750.026780 -0.096438 -0.054586
2019-07-01 11005.0 161.0 -158157.609 1.407294 75.922653 -2129.911552 0.292759 -0.150171
2019-10-01 30570.0 37.0 328696.864 0.041976 123.538524 1712.626726 -0.135783 -0.491249
2020-01-01 -24767.0 -73.0 -938115.399 0.184734 -16.924056 -1333.406271 0.597199 -0.520779
2020-04-01 -2082.0 12.0 -4242953.706 0.196816 -128.147437 -5652.549938 3.486704 -1.294603
2020-07-01 -3721.0 24.0 2828564.567 -1.915431 383.125359 887.166351 -1.254566 -1.066331
2020-10-01 30570.0 37.0 1436766.856 -0.226280 238.480987 3408.909188 -1.479114 -0.475092
2021-01-01 -25390.0 24.0 -731247.989 8.026662 308.241010 5005.818138 -0.311898 -0.137137
2021-04-01 4044.0 25.0 662532.377 2.046887 320.753682 3745.292623 -0.270193 -0.072722
2021-07-01 1199.0 31.0 -236375.579 4.341144 234.492389 1653.000760 -0.040312 0.339241
2021-10-01 38486.0 51.0 744441.461 4.439178 179.587107 436.072010 0.736949 0.554809
2022-01-01 -26654.0 11.0 -450075.199 1.335941 -136.780602 1562.314241 -0.222159 0.842042
2022-04-01 7649.0 20.0 720026.474 -1.427117 -362.603037 -1997.190050 -0.481987 1.039025
2022-07-01 2263.0 26.0 92985.832 -1.422605 -118.913592 -4065.328641 0.200000 1.404996
2022-10-01 39951.0 68.0 711724.075 1.548056 -132.480849 2337.065306 -0.543136 1.553271
2023-01-01 -31671.0 10.0 -689190.328 -0.441167 149.453499 3822.307682 -0.994951 1.042196
2023-04-01 7563.0 20.0 709983.829 -2.244667 209.371318 1096.198112 -0.980961 0.449565
2023-07-01 -728.0 27.0 119329.291 -2.559842 249.743620 -1081.999919 -0.665034 -0.019954
2023-10-01 40485.0 101.0 475091.294 -1.714826 4.867891 -834.812516 0.524433 0.002967
In [671]:
sns.pairplot(df.diff())
Out[671]:
<seaborn.axisgrid.PairGrid at 0x22840702100>
2024-11-01T17:45:48.183305 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

The differences did not exhibit any clear relationship among them, and they tended to follow a closer normal distribution. To confirm these results, the Pearson's correlation coefficient was calculated for the differences.

In [672]:
df.diff().corr()
Out[672]:
net_sales units gdp walmex sp500 ipc exchange_rates interest_rates
net_sales 1.000000 0.438193 0.471910 -0.143753 -0.086474 -0.143666 0.057721 -0.013358
units 0.438193 1.000000 0.138183 -0.094973 -0.006673 -0.018928 -0.177507 0.028760
gdp 0.471910 0.138183 1.000000 -0.087999 0.281858 0.279658 -0.543744 0.105603
walmex -0.143753 -0.094973 -0.087999 1.000000 0.224537 0.394898 -0.035092 -0.092308
sp500 -0.086474 -0.006673 0.281858 0.224537 1.000000 0.572800 -0.399108 -0.347191
ipc -0.143666 -0.018928 0.279658 0.394898 0.572800 1.000000 -0.615247 0.090324
exchange_rates 0.057721 -0.177507 -0.543744 -0.035092 -0.399108 -0.615247 1.000000 -0.236574
interest_rates -0.013358 0.028760 0.105603 -0.092308 -0.347191 0.090324 -0.236574 1.000000
In [673]:
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(df.diff().corr(), cmap=cmap, center=0, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
plt.title('Correlation Matrix')
plt.xlabel('Attributes')
plt.ylabel('Attributes')
Out[673]:
Text(47.10937499999999, 0.5, 'Attributes')
2024-11-01T17:45:54.227306 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Indeed, according to the table above, only weak or moderate correlations were found among the differences.

In this context, it was decided to not include the differences within the dataset for the regression model.

In [674]:
# Concatenating the differences attributes to the dataset
# df_regression = df.copy()
# df_regression = pd.concat([df_regression, df_regression.diff().add_suffix('_diff')], axis=1).dropna()
In [675]:
# Splitting the predictors and response variable for the dataset including differences
# X_reg = df_regression[df_regression.columns.difference(['net_sales', 'net_sales_diff'], sort=False)]
# y_reg = df_regression['net_sales']

5.6 Stationarity ¶

Stationarity of the Net Sales¶

The stationarity of the net sales time series was assessed by testing for unit roots with the Augmented Dickey-Fuller (ADF) test (Peixeiro, 2022), using a 95% confidence level:

$$H_{0}: Non\text{-}stationary$$$$H_{1}: Stationary$$$$\alpha = 0.95$$
In [676]:
stationarity_test = adfuller(y_vals, autolag='AIC')
print("ADF statistic: ", stationarity_test[0])
print("P-value: ", stationarity_test[1])
ADF statistic:  1.3627370980379012
P-value:  0.9969381097148852

As the $p\text{-}value > 0.05$ then the null hypothesis cannot be rejected. So, the time series is not stationary.

In this context, the net sales data was transformed using differencing to stabilize the trend and seasonality (Peixeiro, 2022):

In [677]:
# First order differencing
y_vals_diff = np.diff(y_vals, n=1)
y_diff = pd.DataFrame(y_vals_diff, columns=['Net Sales'], index=y.index[1:])
In [678]:
chart_title = 'First-Order Differenced Net Sales'
plot_linechart(y_diff, is_saved=False)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:45:57.746307 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Indeed, it seems that the differencing of the net sales allowed to stabilize trend thereof. So, the ADF test was applied to the first-order differenced net sales:

In [679]:
stationarity_test_diff = adfuller(y_vals_diff, autolag='AIC')
print("ADF statistic: ", stationarity_test_diff[0])
print("P-value: ", stationarity_test_diff[1])
ADF statistic:  -1.6511761047174953
P-value:  0.45642712357493526

Even though the ADF statistic is now negative, the $p\text{-}value > 0.05$, so the null hypothesis cannot be rejected. So, the time series is still not stationary after the first-order differencing.

Thus, the net sales were transformed again with a second-order differencing, and the ADF test was applied once more.

In [680]:
# Second-order differencing
y_vals_diff2 = np.diff(y_vals_diff, n=1)
y_diff2 = pd.DataFrame(y_vals_diff2, columns=['Net Sales'], index=y.index[2:])
In [681]:
chart_title = 'Second-Order Differenced Net Sales'
plot_linechart(y_diff2, is_saved=False)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:46:00.453305 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
In [682]:
stationarity_test_diff2 = adfuller(y_vals_diff2, autolag='AIC')
print("ADF statistic: ", stationarity_test_diff2[0])
print("P-value: ", stationarity_test_diff2[1])
ADF statistic:  -4.941259634510584
P-value:  2.89677222031049e-05

Now, the ADF statistic is strongly negative and the $p\text{-}value < 0.05$, so the null hypothesis is rejected. The second-order differenced time series is stationary.

Stationarity of the Macroeconomic Indicators¶

Moreover, as the VAR model requires all the time series to be stationary as well (Peixeiro, 2022), the stationarity of the other macroeconomic variables was also tested with the Augmented Dickey-Fuller (ADF) test:

$$H_{0}: Non\text{-}stationary$$$$H_{1}: Stationary$$$$\alpha = 0.95$$
In [683]:
def test_stationarity(df: pd.DataFrame, alpha: float) -> None:
        """
        Tests for stationarity by applying the Augmented Dickey-Fuller (ADF) test to each of the columns
        in the input dataset, at the provided significance level.

        Parameters:
        df (pandas.DataFrame): Dataset with the variables to be tested for stationarity.
        alpha (float): Specified significance level.

        Returns:
        None
        """

        for col in df.columns:
                
                stationarity_test = adfuller(df[col], autolag='AIC')
                print(f'{col}:')
                print(f"ADF statistic: {stationarity_test[0]:.03f}")
                print(f"P-value: {stationarity_test[1]:.03f}")

                if stationarity_test[1] <= alpha:
                    print("The series is stationary.\n")
                else:
                    print("The series is not stationary.\n")
In [684]:
test_stationarity(X, 0.05)
units:
ADF statistic: 0.403
P-value: 0.982
The series is not stationary.

gdp:
ADF statistic: -3.197
P-value: 0.020
The series is stationary.

walmex:
ADF statistic: -1.143
P-value: 0.697
The series is not stationary.

sp500:
ADF statistic: -0.191
P-value: 0.940
The series is not stationary.

ipc:
ADF statistic: -2.044
P-value: 0.268
The series is not stationary.

exchange_rates:
ADF statistic: -2.308
P-value: 0.169
The series is not stationary.

interest_rates:
ADF statistic: -3.185
P-value: 0.021
The series is stationary.

Thus, only the gdp and interest_rates series are stationary, while the other were not. So, for simplicity, first-order differencing transformations were applied to all the series in order to simplify the detransformation when doing predictions with the VAR model.

In [685]:
# First-order differencing

X_diff = X.diff().dropna()

test_stationarity(X_diff, 0.05)
units:
ADF statistic: -5.445
P-value: 0.000
The series is stationary.

gdp:
ADF statistic: -6.037
P-value: 0.000
The series is stationary.

walmex:
ADF statistic: -4.932
P-value: 0.000
The series is stationary.

sp500:
ADF statistic: -4.322
P-value: 0.000
The series is stationary.

ipc:
ADF statistic: -5.091
P-value: 0.000
The series is stationary.

exchange_rates:
ADF statistic: -4.562
P-value: 0.000
The series is stationary.

interest_rates:
ADF statistic: -3.253
P-value: 0.017
The series is stationary.

Thus, with the first-order differencing was enough to render all the macroeconomic indicators series as stationaries.

5.7 Autocorrelation Function ¶

The autocorrelation of the net sales time series was assessed by means of the Autorcorrelation Function (ACF) plot, to explore the significant coefficients/lags in the series.

In [686]:
chart_title = 'Net Sales Autocorrelation'
plot_acf(y_vals, lags=20)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:46:01.773305 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Thus, according to the ACF plot, the lags 2, 3, 4, and 5 are significant, which means that the time series is actually dependent on its past values.

In [687]:
chart_title = 'First-Order Differenced Net Sales Autocorrelation'
plot_acf(y_vals_diff, lags=20)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:46:03.609308 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In this case, as there are no consecutive significant lags after lag 2, it is possible to state that the first-order differenced net sales is autocorrelated only at lags 1 and 2.

In [688]:
chart_title = 'Second-Order Differenced Net Sales Autocorrelation'
plot_acf(y_vals_diff2, lags=20)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:46:05.129315 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Likewise, as there are no consecutive significant lags after lag 2, it is possible to state that the second-order differenced net sales is autocorrelated only at lags 1 and 2, too. However, it is clear that the overall trend of the ACF follows a sinusoidal-like pattern. This is evidence that the time series is autorregressive (Pexerio, 2022).

As an important conclusion so far, as the ADF test of the first-order differenced net sales showed that the time series was not stationary, and the ACF plot of the first-order differenced net sales showed an autocorrelation at lag 2, then, the net sales time series is not a random walk. This means that statistical learning techniques can be successfully applied to estimate forecasts on the data (Peixeiro, 2022).

5.8 Decomposition ¶

As part of the explotatory data analysis, the WALMEX net sales series was decomposed using the seasonal_decompose method from the statsmodels library.

Firstly, an additive decomposition was performed:

In [689]:
chart_title = 'Additive Decomposition for WALMEX Net Sales'
decomposition = seasonal_decompose(y, model='additive')
decomposition.plot()
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:46:10.253413 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the plot above, it is clear that the WALMEX net sales adjusted very well to a additive model, as the trend is a straight line with a positive slope, and the seasonality could be isolated very well from the series exhibiting both regular frequency (with a cycle number of 4) and amplitude.

Then, a multiplicative decomposition was performed as well on the series:

In [690]:
chart_title = 'Multiplicative Decomposition for WALMEX Net Sales'
decomposition = seasonal_decompose(y, model='multiplicative')
decomposition.plot()
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:46:20.144443 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

As the residuals in the multiplicative decomposition show an uniform value, this is an indication that the decomposition failed to properly capture all the information from the series. So, this decomposition approach is not appropriate.

5.9 Cointegration Test ¶

To check whether the variables shared a common stochastic trend, or more formally, whether there is a linear combination of these variables that is stable (stationary) (Kotzé, 2024); Johansen's Cointegration test was performed (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

The Johansen test checks the cointegration rank $r$ of a given set of vectors. The null hypothesis of $r=0$ implies no cointegration relationship; whereas a rank of $r\leq1$ means at least one cointegration relationship, a rank of $r\leq2$ means at least two cointegration relationships, etc. (Wang, 2018):

$$H_{0}: Non\text{-}cointegration$$$$H_{1}: Cointegration$$$$\alpha = 0.95$$

The test was performed using the most restrictive deterministic term order that assumes a linear trend (det_order=1), and one lag (k_ar_diff=1):

In [691]:
def test_cointegration(time_series: pd.DataFrame, det_order: int = 1, k_ar_diff: int = 1) -> pd.DataFrame:
        """
        Performs the Johansen's Cointegration test and returns the results.

        Parameters:

        time_series (pandas.DataFrame): Time series data for the cointegration test
        det_order (int): The order of the deterministic terms:
                            -1: No constant or trend.
                            0: Constant term only.
                            1: Constant and trend terms.
        k_ar_diff (int): The number of lags.

        Returns:
        
        results_table (pandas.DataFrame): Results from the Johansen's Cointegration test.

        """

        coint_test_result = coint_johansen(endog=time_series, det_order=det_order, k_ar_diff=k_ar_diff)

        trace_stats = coint_test_result.trace_stat
        trace_stats_crit_vals = coint_test_result.trace_stat_crit_vals

        rank_list_int = list(range(0, time_series.shape[1]))
        rank_list_str = [str(x) for x in rank_list_int]
        rank_list_prefix = ['r = ']
        rank_list_prefix = rank_list_prefix + (['r <= ',] * (len(rank_list_str) - 1))

        rank_list = []

        for i in range(len(rank_list_int)):
                rank_list.append(rank_list_prefix[i] + rank_list_str[i])

        result_dict = {'Rank':rank_list, 
                        'CL 90%':trace_stats_crit_vals[:,0], 
                        'CL 95%':trace_stats_crit_vals[:,1], 
                        'CL 99%':trace_stats_crit_vals[:,2], 
                        'Trace Statistic': trace_stats}
                       
        results_table = pd.DataFrame(result_dict)  

        return results_table
In [692]:
results = test_cointegration(time_series=X, det_order=1, k_ar_diff=1)
results
Out[692]:
Rank CL 90% CL 95% CL 99% Trace Statistic
0 r = 0 133.7852 139.2780 150.0778 195.598921
1 r <= 1 102.4674 107.3429 116.9829 135.936953
2 r <= 2 75.1027 79.3422 87.7748 86.363668
3 r <= 3 51.6492 55.2459 62.5202 49.849412
4 r <= 4 32.0645 35.0116 41.0815 27.757403
5 r <= 5 16.1619 18.3985 23.1485 13.290473
6 r <= 6 2.7055 3.8415 6.6349 4.501931

Thus, it is possible to easily reject the null hypothesis of $r=0$, as the trace statistic is above the critical values at the 90%, 95%, and 99% confidence levels. However, the test results above suggested that there is up to two cointegration relationships a the 95% confidence level. So, it is not possible to state that all the variables are cointegrated.

To find the two cointegration relationships, from the relationships and correlations sections, it was clear that the variables units, SP&500, and WALMEX were strongly correlated. So, the cointegration test was performed using only those series.

In [693]:
results = test_cointegration(time_series=X[['units', 'sp500', 'walmex']], det_order=1, k_ar_diff=1)
results
Out[693]:
Rank CL 90% CL 95% CL 99% Trace Statistic
0 r = 0 32.0645 35.0116 41.0815 31.360790
1 r <= 1 16.1619 18.3985 23.1485 14.621989
2 r <= 2 2.7055 3.8415 6.6349 4.138262

From the results above, it is possible to reject the null hypothesis of $r=0$, as the trace statistic for $r<=2$ is above the critical value at the 95% confidence level.

Therefore, the Vector models (VAR, VARMA, VARIMA) were later built with the above-mentioned series.


6. Modeling¶


Firstly, ten univariate time series models were built to predict the net sales of WALMEX:

  • Moving Average (MA) model,
  • Autoregressive (AR) model,
  • a series of Autoregressive (AR) models with Additive Decomposition,
  • Autoregressive Moving Average (ARMA) model,
  • Autoregressive Integrated Moving Average (ARIMA) model,
  • Seasonal Autoregressive Integrated Moving Average (SARIMA) model,
  • Seasonal Autoregressive Integrated Moving Average with Exogenous Variables (SARIMAX) Model,
  • Simple Exponential Smoothing (SES) model,
  • Holt-Winters (HW) model, and
  • Prophet Univariate Time Series Modeling.

Then, three vector models were created and trained in Python 3 and its libraries Statsmodels and Darts to predict the values of the selected macroeconomic indicators as a multivariate time series:

  • Vector Autoregressive (VAR) model,
  • Vector Autoregressive Moving Average (VARMA) model, and
  • Vector Autoregressive Integrated Moving Average (VARIMA) model

After that, two regression models were built using Random Forests and Support Vector Machines in Python 3 and its library Scikit-learn to predict WALMEX total sales based on the predictions for the best performing multivariate time series model.

On the other hand, the models were assessed using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the Coefficient of Determination ($r^{2}$).

For reporting purposes, the model assessment plots were shown in the original scale of the WALMEX net sales data.

Finally, for sake of clarity, each model was described separately in the present section.

In [694]:
# Function to calculate the RMSE, MAE and Coefficient of Determination
sales_actuals_for_scoring = pd.DataFrame(y[(int(0.8*(len(y)))):], columns=["net_sales"])

def calculate_scores(predictions: pd.DataFrame, actuals: pd.DataFrame = sales_actuals_for_scoring) -> Tuple[List[float], List[float], List[float]]:
        """
        Calculates the RMSE, the MAE and the Coefficient of Determination for a given set of predictions according to the provided actuals.

        Parameters:
        predictions (pandas.DataFrame): Time series predictions for testing period.
        actuals (pandas.DataFrame): Time series actual values for testing period.

        Returns:
        rmse (List[float]): Root Mean Squared Error of the predictions.
        mae (List[float]): Mean Absolute Error of the predictions.
        coeff_det (List[float]): Coefficient of determination (r^2) of the predictions.
        """

        rmse = []
        mae = []
        coeff_det = []

        if len(actuals) == len(predictions):
                        
                for i in range(0, actuals.shape[1]):

                        print(f'{actuals.columns[i]}')

                        rmse_value = np.sqrt(mean_squared_error(actuals.iloc[:,i].values, predictions.iloc[:,i].values))
                        mae_value = mean_absolute_error(actuals.iloc[:,i].values, predictions.iloc[:,i].values)
                        coeff_det_value = r2_score(actuals.iloc[:,i].values, predictions.iloc[:,i].values)

                        print(f'RMSE: {rmse_value:.3f}')
                        print(f'MAE: {mae_value:.3f}')
                        print(f'Coefficient of Determination: {coeff_det_value:.3f}\n')

                        rmse.append(rmse_value)
                        mae.append(mae_value)
                        coeff_det.append(coeff_det_value)

                return rmse, mae, coeff_det
                

        else:

                print('Number of features is different between the testing set and the predictions set.')
In [695]:
# Function to plot the train, test and predictions for a given model
sales_data_for_training = y[:(int(0.8*(len(y)))) + 1]
sales_data_for_testing = y[(int(0.8*(len(y)))):]

def plot_predictions(y_pred: pd.DataFrame,  model_name: str, y_train: pd.DataFrame = sales_data_for_training, y_test: pd.DataFrame = sales_data_for_testing) -> None:
        """
        Plots a line chart with the training, testing and predictions sets for a given model.

        Parameters:
        y_train (pandas.DataFrame): Training dataset
        y_test (pandas.DataFrame): Testing dataset
        y_pred (pandas.DataFrame): Predictions from the model
        model_name (str): Name of the model for the plot title.

        Returns:
        None
        """

        chart_title = 'Predictions from ' +  model_name + ' vs. WALMEX Historical Net Sales'

        # Plots
        fig, ax = plt.subplots(figsize=(8,5))
        sns.lineplot(y_train, ax=ax, zorder=2)
        sns.lineplot(y_test, ax=ax, color='green', linestyle='dashed', zorder=1)
        sns.lineplot(y_pred, ax=ax, color = pred_color, lw=2, zorder=3)

        # Adding shadow to predictions
        first_pred_x = y_test.index[0]
        last_pred_x = y_test.index[-1]
        ax.axvspan(first_pred_x, last_pred_x, color='#808080', alpha=0.2)

        # Adding legend to plot
        legend_lines = [Line2D([0], [0], color=time_series_color, lw=2),
                        Line2D([0], [0], color='green', lw=2,  linestyle='dashed'),
                        Line2D([0], [0], color=pred_color, lw=2)]
        plt.legend(legend_lines, 
                   ['Training', 'Testing', 'Predictions'], 
                   loc='upper left', 
                   facecolor='white',
                   frameon = True)

        # Adding labels
        plt.title(chart_title)
        plt.ylabel('Net Sales (mdp)')
        plt.xlabel('Time')

        # Adjusting Y ticks to Currency format
        ticks = ax.get_yticks()
        new_labels = [f'${int(i):,.0f}' for i in ticks]
        ax.set_yticklabels(new_labels)

        plt.savefig(f'Images/fig_{chart_title.lower().replace(".", "").replace(" ", "_")}.png',  bbox_inches = 'tight')
        plt.show()
In [696]:
# Creating list to store the scores from each univariate time series model
univariate_models_list = ['MA', 'AR', 'AR w/Additive Decomp.', 'ARMA', 'ARIMA', 'SARIMA', 'SARIMAX', 'SES', 'HW', 'Prophet']
rmse_list = []
mae_list = []
r2_list = []

# Creating lists to store the scores from multivariate time series models
multivariate_models_list = ['VAR', 'VARMA', 'VARIMA']
X_rmse_list = []
X_mae_list = []
X_r2_list = []

# Creating lists to store the scores from regression models
regression_models_list = ['RF', 'SVR']
reg_rmse_list = []
reg_mae_list = []
reg_r2_list = []

6.1 Moving Average (MA) Model ¶

Modeling Technique ¶

A Moving Average (MA) model was built as a baseline model. As suggested by the name of this technique, the MA model calculates the moving average changes over time for a certain variable using a specified window lenght (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023). This technique has the advantage of removing random fluctuations and smoothen the data.

The MA model is denoted as MA($q$), where $q$ is the number of past error terms that affects the present value of the time series (Peixeiro, 2022).

The model was built using the library pandas in Python.

Modeling Assumptions ¶

It was assumed that the current values are linearly dependent on the mean of the series, the current and past error terms; the errors are normally distributed (Peixeiro, 2022).

On the other hand, from the ACF plot in the EDA section, it was assumed that a window of 5 was enough to capture the overall trend of the data, as the ACF plot showed that lag 5 was the last significant coefficient.

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [697]:
time_series = y

cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff + 1]
y_test = time_series[cutoff:]

Model Building¶

In [698]:
# Window length
ma_window = 5

# Calculating the moving average
y_pred_ma_model = y.rolling(ma_window).mean()

# Narrowing down the moving average calculations to the y_test time period
y_pred_ma_model = y_pred_ma_model[len(y_train) - 1:]
y_pred_ma_model
Out[698]:
date
2022-01-01    183352.2
2022-04-01    188297.0
2022-07-01    192885.6
2022-10-01    205224.6
2023-01-01    203532.2
2023-04-01    208683.2
2023-07-01    212158.8
2023-10-01    223278.8
Name: net_sales, dtype: float64

Model Assessment¶

The predictions were plot against the historical net sales data to visually assess the performance of the MA model.

In [699]:
plot_predictions(y_pred_ma_model, 'MA Model')
2024-11-01T17:46:23.782443 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plot above, the MA model was able to capture the trend of the time series but not their stationality.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:

In [700]:
ma_rmse, ma_mae, ma_r2 = calculate_scores(pd.DataFrame(y_pred_ma_model))
net_sales
RMSE: 15216.901
MAE: 9651.900
Coefficient of Determination: 0.465

In [701]:
# Adding scores to lists
rmse_list.append(ma_rmse[0])
mae_list.append(ma_mae[0])
r2_list.append(ma_r2[0])

6.2 Autoregressive (AR) Model ¶

Modeling Technique ¶

As suggested by the ACF plot in the EDA, a Autoregressive (AR) model was built to forecast the net sales of WALMEX.

An autoregressive model implies a regression of a variable against itself, which means that the present value of a given point in a time series is linearly dependent on its past values (Peixeiro, 2022).

An AR model is denoted as AR($p$), where the order $p$ determines the number of past values that affect the present value (Peixeiro, 2022).

The model was built using the library statmodels in Python.

Modeling Assumptions ¶

It was assumed that the current values are linearly dependent on their past values (Peixeiro, 2022). Moreover, it was assumed that the time series was stationary, it was not a random walk, and seasonality effects were not relevant for modeling.

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

It is important to note that the second-order differenced net sales time series was used for modeling, as AR models require that the time series be stationary.

In [702]:
time_series = y_diff2

cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]

Partial Autocorrelation Function¶

Then, the order for each AR Model was identified by using the partial autocorrelation function (PACF) plot to assess the effect of past data (the so-called lags) on future data (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

In [703]:
# Trend Component 
chart_title = 'Partial Autocorrelation for Second-Order Differenced Net Sales'
pacf = plot_pacf(y_train, lags=10)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
2024-11-01T17:46:26.715441 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

So, the order of the AR model for the trend component is 3.

Model Building¶

In [704]:
ar_model = AutoReg(y_diff2, lags=3).fit()
In [705]:
# Summary for the AR Model
print(ar_model.summary())
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:              Net Sales   No. Observations:                   38
Model:                     AutoReg(3)   Log Likelihood                -350.623
Method:               Conditional MLE   S.D. of innovations           5425.504
Date:                Fri, 01 Nov 2024   AIC                            711.246
Time:                        17:46:26   BIC                            719.023
Sample:                    04-01-2015   HQIC                           713.931
                         - 10-01-2023                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const          142.5663    918.068      0.155      0.877   -1656.815    1941.947
Net Sales.L1    -1.0230      0.037    -27.670      0.000      -1.095      -0.951
Net Sales.L2    -1.0104      0.047    -21.417      0.000      -1.103      -0.918
Net Sales.L3    -1.0203      0.038    -27.187      0.000      -1.094      -0.947
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -0.9838           -0.0000j            0.9838           -0.5000
AR.2           -0.0033           -0.9981j            0.9981           -0.2505
AR.3           -0.0033           +0.9981j            0.9981            0.2505
-----------------------------------------------------------------------------

Predictions¶

In [706]:
# Defining starting and ending points
start_point = len(y_diff2) - len(y_test) - 1
end_point = start_point +  len(y_test)

# Getting predictions for each component
y_pred_ar_model = ar_model.predict(start=start_point, end=end_point, dynamic=False)
y_pred_ar_model
Out[706]:
2022-01-01   -65160.762579
2022-04-01    32011.282594
2022-07-01    -7178.561389
2022-10-01    37457.043271
2023-01-01   -67971.549635
2023-04-01    40830.037800
2023-07-01    -6082.944499
2023-10-01    42060.452083
Freq: QS-OCT, dtype: float64

Finally, the differencing transformation was reversed twice in order to bring the calculated predictions back to scale of the data.

To do so, the cumulative sum of the predictions was calculated and add it to the correspondent last value of the training set in the first-order differenced time series to reverse the transformation from a second-order to a first-order one.

Then, the cumulative sum of the time series above was calculated and add it to the correspondent last value of the training set in the original time series to reverse the transformation from the first-order differencing to the original scale.

In [707]:
# Reverting to first-order differenced net sales
y_pred_ar_model = y_diff.loc['2021-10-01'].values + y_pred_ar_model.cumsum()
y_pred_ar_model
Out[707]:
2022-01-01   -26674.762579
2022-04-01     5336.520015
2022-07-01    -1842.041374
2022-10-01    35615.001897
2023-01-01   -32356.547738
2023-04-01     8473.490062
2023-07-01     2390.545563
2023-10-01    44450.997646
Freq: QS-OCT, dtype: float64
In [708]:
# Reverting to original scale from net sales
y_pred_ar_model = y.loc['2021-10-01'] + y_pred_ar_model.cumsum()
y_pred_ar_model
Out[708]:
2022-01-01    186388.237421
2022-04-01    191724.757436
2022-07-01    189882.716061
2022-10-01    225497.717958
2023-01-01    193141.170220
2023-04-01    201614.660282
2023-07-01    204005.205846
2023-10-01    248456.203492
Freq: QS-OCT, dtype: float64

Model Assessment¶

The predictions were plot against the historical net sales data to visually assess the performance of the AR model.

In [709]:
plot_predictions(y_pred_ar_model, 'AR Model')
2024-11-01T17:46:30.006445 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plot above, the AR model was able to capture both the trend and the stationality of the time series.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:

In [710]:
ar_rmse, ar_mae, ar_r2 = calculate_scores(pd.DataFrame(y_pred_ar_model))
net_sales
RMSE: 7687.806
MAE: 6558.916
Coefficient of Determination: 0.863

In [711]:
# Adding scores to lists
rmse_list.append(ar_rmse[0])
mae_list.append(ar_mae[0])
r2_list.append(ar_r2[0])

6.3 Autoregressive (AR) Models with Additive Decomposition ¶

Modeling Technique ¶

A set of Autoregressive (AR) Models based on the Additive Decomposition of the WALMEX net sales time series were built.

The Additive Decomposition was deemed as an appropiate decomposition technique as the EDA suggested that the WALMEX net sales behave in a linear fashion, with constant changes over time, and with a regular seasonality with equal frequency and amplitude (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

Then, a set of Autoregressive Models were built for each component of the time series obtained via the decomposition: trend, seasonality, and remainder. The net sales were forecasted by adding up the predictions from each AR model (Jamieson, 2019).

Likewise, the library statsmodels in Python was used to perform the Additive Decomposition and build the AR Models.

Modeling Assumptions ¶

It is assumed that the components in the net sales time series can be added together as their changes over time are regular, the time trend is a straight line, and the seasonality has the same frequency and amplitude (Brownlee, 2020).

Moreover, it is assumed that each component of the time series can be predicted by means of a linear combination of their historical or lag values (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [712]:
time_series = y

cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff + 1]
y_test = time_series[cutoff:]

Additive Decomposition¶

As showed in the EDA, WALMEX net sales adjusted very well to an additive decomposition. So the training set was decomposed using the Additive model with the class seasonal_decompose from the statsmodels library.

In [713]:
chart_title = 'Additive Decomposition for WALMEX Net Sales Training Set'
decomposition = seasonal_decompose(y_train, model='additive')
decomposition.plot()
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:46:40.419827 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
In [714]:
# Creating variables for each component
trend = decomposition.trend.dropna()
seasonality = decomposition.seasonal.dropna()
remainder = decomposition.resid.dropna()

However, the AR model require the time series to be stationary, so the trend was differenced several times until the ADF test indicated that no unit root was found in the data.

In [715]:
trend_diff = trend.diff().dropna()
trend_diff2 = trend_diff.diff().dropna()
trend_diff3 = trend_diff2.diff().dropna()
In [716]:
stationarity_test = adfuller(trend_diff3, autolag='AIC')
print("ADF statistic: ", stationarity_test[0])
print("P-value: ", stationarity_test[1])
ADF statistic:  -5.065106267573506
P-value:  1.64698660737564e-05

As the $p\text{-}value > 0.05$ then the null hypothesis is rejected. So, the third-order differenced time series is stationary.

Partial Autocorrelation Function¶

Then, the order for each AR Model was identified by using the partial autocorrelation function (PACF) plot to assess the effect of past data (the so-called lags) on future data (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

In [717]:
# Trend Component 
chart_title = 'Partial Autocorrelation for Trend Component'
trend_pacf = plot_pacf(trend_diff3, lags=10)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:46:41.702988 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

So, the order of the AR model for the trend component is 1.

In [718]:
# Seasonality Component 
chart_title = 'Partial Autocorrelation for Seasonality Component'
seasonality_pacf = plot_pacf(seasonality, lags=10)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:46:42.625132 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

So, the order of the AR model for the seasonality component is 3.

In [719]:
# Remainder Component 
chart_title = 'Partial Autocorrelation for Remainder Component'
remainder_pacf = plot_pacf(remainder, lags=10)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:46:44.192759 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

So, the order of the AR model for the remainder component is 1.

Model Building¶

Then, a AR model was fitted for each component from the additive decomposition: trend, seasonality, and remainder using the AutoReg class from statsmodels.

In [720]:
trend_ar_model = AutoReg(trend_diff3, lags=1).fit()
seasonality_ar_model = AutoReg(seasonality, lags=3).fit()
remainder_ar_model = AutoReg(remainder, lags=1).fit()
In [721]:
# Summary for the AR Model for Trend component
print(trend_ar_model.summary())
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                  trend   No. Observations:                   26
Model:                     AutoReg(1)   Log Likelihood                -210.538
Method:               Conditional MLE   S.D. of innovations           1099.471
Date:                Fri, 01 Nov 2024   AIC                            427.076
Time:                        17:46:44   BIC                            430.733
Sample:                    07-01-2015   HQIC                           428.090
                         - 07-01-2021                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.6672    220.132      0.039      0.969    -422.783     440.117
trend.L1      -0.1201      0.200     -0.600      0.549      -0.512       0.272
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -8.3290           +0.0000j            8.3290            0.5000
-----------------------------------------------------------------------------
In [722]:
# Summary for the AR Model for Seasonality component
print(seasonality_ar_model.summary())
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:               seasonal   No. Observations:                   33
Model:                     AutoReg(3)   Log Likelihood                 749.516
Method:               Conditional MLE   S.D. of innovations              0.000
Date:                Fri, 01 Nov 2024   AIC                          -1489.032
Time:                        17:46:44   BIC                          -1482.026
Sample:                    10-01-2014   HQIC                         -1486.790
                         - 01-01-2022                                         
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
const        4.832e-12   6.25e-13      7.732      0.000    3.61e-12    6.06e-12
seasonal.L1    -1.0000   6.66e-17   -1.5e+16      0.000      -1.000      -1.000
seasonal.L2    -1.0000   6.69e-17  -1.49e+16      0.000      -1.000      -1.000
seasonal.L3    -1.0000    6.9e-17  -1.45e+16      0.000      -1.000      -1.000
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -1.0000           -0.0000j            1.0000           -0.5000
AR.2            0.0000           -1.0000j            1.0000           -0.2500
AR.3            0.0000           +1.0000j            1.0000            0.2500
-----------------------------------------------------------------------------
In [723]:
# Summary for the AR Model for Remainder component
print(remainder_ar_model.summary())
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                  resid   No. Observations:                   29
Model:                     AutoReg(1)   Log Likelihood                -255.912
Method:               Conditional MLE   S.D. of innovations           2254.688
Date:                Fri, 01 Nov 2024   AIC                            517.823
Time:                        17:46:44   BIC                            521.820
Sample:                    10-01-2014   HQIC                           519.045
                         - 07-01-2021                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const       -229.7760    426.350     -0.539      0.590   -1065.407     605.855
resid.L1      -0.1896      0.195     -0.975      0.330      -0.571       0.192
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -5.2739           +0.0000j            5.2739            0.5000
-----------------------------------------------------------------------------

Predictions¶

Predictions from each AR model were obtained for the lenght of the testing set.

However, as the trend predictions were transformed to be modeled, their predictions were brought back to the original scale of the data by adding up the last value of the training set and calculating the cummulative sum of the predictions. This process was performed three times as the transformed trend time series was of third order.

Finally, the net sales forecast was built by recomposing (adding up) each of the predictions from the AR models into a single time series.

In [724]:
# Defining starting and ending points
start_point = len(y) - len(y_test) - 2
end_point = start_point +  len(y_test) - 1

# Getting predictions for each component
trend_predictions = trend_ar_model.predict(start=start_point - 3, end=end_point - 3, dynamic=False)
seasonality_predictions = seasonality_ar_model.predict(start=start_point + 2, end=end_point + 2, dynamic=False)
remainder_predictions= remainder_ar_model.predict(start=start_point, end=end_point, dynamic=False)

# De-transforming trend predictions to original scale
trend_predictions = trend_diff2.loc['2021-07-01'] + trend_predictions.cumsum()
trend_predictions = trend_diff.loc['2021-07-01'] + trend_predictions.cumsum()
trend_predictions = trend.loc['2021-07-01'] + trend_predictions.cumsum()

# Recomposing component predictions into a single time series
y_pred_ar_models = trend_predictions + seasonality_predictions + remainder_predictions
y_pred_ar_models
Out[724]:
2022-01-01    183595.994447
2022-04-01    188927.735192
2022-07-01    196410.599671
2022-10-01    230031.384667
2023-01-01    213189.781651
2023-04-01    221794.739136
2023-07-01    232762.447850
2023-10-01    269864.772322
Freq: QS-OCT, dtype: float64

Model Assessment¶

Likewise, the predictions were plot against the historical net sales data to visually assess the performance of the AR models with additive decomposition.

In [725]:
plot_predictions(y_pred_ar_models, 'AR Models With Additive Decomposition')
2024-11-01T17:46:47.942436 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plot above, the additive decomposition and the AR models were able to provide closer predictions for the WALMEX net sales.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:

In [726]:
ar_decomp_rmse, ar_decomp_mae, ar_decomp_r2 = calculate_scores(pd.DataFrame(y_pred_ar_models))
net_sales
RMSE: 11272.204
MAE: 8970.403
Coefficient of Determination: 0.706

In [727]:
# Adding scores to lists
rmse_list.append(ar_decomp_rmse[0])
mae_list.append(ar_decomp_mae[0])
r2_list.append(ar_decomp_r2[0])

6.4 Autoregressive Moving Average (ARMA) Model ¶

Modeling Technique ¶

An Autoregressive Moving Average (ARMA) model based on the WALMEX net sales time series was built.

The ARMA model is a combination of the autoregressive process and the moving average process, and it is denoted as ARMA($p$,$q$), where $p$ is the order of the autoregressive process, and $q$ is the order of the moving average process. The order of $p$ determines the number of past values that affect the present value; whereas the order of $q$ determines the number of past error terms that affect the present value (Peixeiro, 2022).

Indeed, the ACF and PACF plots shown earlier suggested that the time series was neither a pure moving average or a pure autorregressive process as sinusoidal patterns were found and no clear cut-off values between significant and no significant lags were identified. So, the ARMA model could yield better prediction results (Peixeiro, 2022).

Several orders $p$ and $q$, for the autoregressive and moving average portions of the model, respectively, were tested and the best performing model was selected according to the Akaike Information Criterion (AIC) and a residual analysis (Peixeiro, 2022).

Likewise, the function ARIMA from the library statsmodels in Python was used to build the ARMA model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

Modeling Assumptions ¶

It is assumed that the present values in the time series are linearly dependend on both its past values and on the mean, the current error and the past errors terms (Peixeiro, 2022).

Another key assumption is that the dataset was stationary when performing a second-order differencing.

Moreover, it was assumed that the residuals from the model were normally distributed, independent, and uncorrelated; approximanting to white noise (Peixeiro, 2022).

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [728]:
time_series = y_diff2

cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]

Model Building ¶

In this case, as several values for $p$ and $q$ were tested, a function to fit several ARMA models and calculate their corresponding AIC score was defined.

In [729]:
def optimize_arma_model(p: range, q: range, time_series: pd.Series) -> ARIMA:
        """
        Optimize an autoregressive moving average (ARMA) model given a set of p and q values, by minimizing the Akaike Information Criterion (AIC).

        Parameters:
        p (range): Range for order p in the autoregressive portion of the ARMA model.
        q (range): Range for order q in the moving average portion of the ARMA model.
        time_series (pandas.Series): Time series data for fitting the ARMA model.

        Returns:
        arma_model (statsmodels.tsa.arima.model.arima): An ARIMA object fitted according to the combination of p and q that minimizes 
        the Akaike Information Criterion (AIC).
        

        """
        # Obtaining the combinations of p and q
        order_list = list(product(p, q))

        # Creating emtpy lists to store results
        order_results = []
        aic_results = []

        # Fitting models
        for order in order_list:

                arma_model = ARIMA(time_series, order = (order[0], 0, order[1])).fit()
                order_results.append(order)
                aic_results.append(arma_model.aic)
        
        # Converting lists to dataframes
        results = pd.DataFrame({'(p,q)': order_results,
                                'AIC': aic_results                                
                                })        
        # Storing results from the best model
        lowest_aic = results.AIC.min()
        best_model = results.loc[results['AIC'] == lowest_aic, ['(p,q)']].values[0][0]

        # Printing results
        print(f'The best model is (p = {best_model[0]}, q = {best_model[1]}), with an AIC of {lowest_aic:.02f}.\n')         
        print(results)     

        # Fitting best model again
        arma_model = ARIMA(time_series, order = (best_model[0], 0, best_model[1])).fit()

        return arma_model

Then, several ARMA models were fitted using the previously identified significant lags plus one according to the ACF and PACF plots:

  • $p$: From 1 to 5
  • $q$: From 1 to 6
In [730]:
arma_model = optimize_arma_model(p=range(1,6), q=range(1,7), time_series=y_train)
The best model is (p = 4, q = 5), with an AIC of 634.21.

     (p,q)          AIC
0   (1, 1)   689.059291
1   (1, 2)   689.249985
2   (1, 3)   690.762092
3   (1, 4)   690.935897
4   (1, 5)   687.504766
5   (1, 6)   665.427105
6   (2, 1)   710.716053
7   (2, 2)   690.696598
8   (2, 3)   691.233741
9   (2, 4)   692.495432
10  (2, 5)   689.913298
11  (2, 6)   672.768398
12  (3, 1)   638.414862
13  (3, 2)   636.733963
14  (3, 3)   635.917669
15  (3, 4)   634.837998
16  (3, 5)   638.067682
17  (3, 6)   676.598570
18  (4, 1)   635.632338
19  (4, 2)   636.837116
20  (4, 3)   639.042331
21  (4, 4)   637.188384
22  (4, 5)   634.211084
23  (4, 6)  1854.993302
24  (5, 1)   637.444485
25  (5, 2)   637.908936
26  (5, 3)   638.939437
27  (5, 4)   637.940815
28  (5, 5)   634.471567
29  (5, 6)  1754.478023

So, the model ARMA(4, 5) was found to be the best model relative to all others models fit.

In [731]:
arma_model.summary()
Out[731]:
SARIMAX Results
Dep. Variable: Net Sales No. Observations: 30
Model: ARIMA(4, 0, 5) Log Likelihood -306.106
Date: Fri, 01 Nov 2024 AIC 634.211
Time: 17:46:58 BIC 649.624
Sample: 07-01-2014 HQIC 639.142
- 10-01-2021
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
const 1219.6968 2549.815 0.478 0.632 -3777.849 6217.243
ar.L1 -0.0401 0.270 -0.149 0.882 -0.569 0.489
ar.L2 -0.0458 0.236 -0.194 0.846 -0.508 0.417
ar.L3 -0.0438 0.218 -0.201 0.841 -0.471 0.384
ar.L4 0.9518 0.226 4.211 0.000 0.509 1.395
ma.L1 -0.9058 0.794 -1.141 0.254 -2.461 0.650
ma.L2 0.1759 0.737 0.239 0.811 -1.269 1.621
ma.L3 -0.1859 1.115 -0.167 0.868 -2.371 1.999
ma.L4 -0.5263 0.626 -0.840 0.401 -1.754 0.701
ma.L5 0.6863 0.689 0.996 0.319 -0.665 2.037
sigma2 3.726e+07 0.023 1.59e+09 0.000 3.73e+07 3.73e+07
Ljung-Box (L1) (Q): 0.51 Jarque-Bera (JB): 1.00
Prob(Q): 0.47 Prob(JB): 0.61
Heteroskedasticity (H): 1.82 Skew: 0.43
Prob(H) (two-sided): 0.36 Kurtosis: 2.73


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 6.42e+25. Standard errors may be unstable.

Residual Analysis¶

The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).

To do so, the residual analysis is performed in two steps (Peixeiro, 2022):

  • Quantile-quantile plot (Q-Q plot): To qualitatively assess whether the residuals from the model are normally distributed.
  • Ljung-Box test: To test whether the residuals from the model are uncorrelated.
In [732]:
# Storing residuals
residuals = arma_model.resid
In [733]:
# Plotting QQ-Plot
chart_title = 'Diagnostic plots for standardized residuals from ARMA model'
arma_model.plot_diagnostics(figsize=(10, 8))
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:47:03.365802 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the diagnostics plots, the residuals are not completely normally distributed. However, it was considered that the results were good enough for the purposes of the present study, as other more sophisticated models were fit below.

Then, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:

$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$
In [734]:
# Ljung-Box test of the residuals on 10 lags
acorr_ljungbox(residuals, np.arange(1, 11))
Out[734]:
lb_stat lb_pvalue
1 3.271269 0.070503
2 3.401430 0.182553
3 3.420409 0.331233
4 3.519843 0.474868
5 3.689612 0.594911
6 3.763542 0.708639
7 4.246189 0.751025
8 7.457154 0.488205
9 7.726973 0.561878
10 8.526754 0.577525

For each of the lags from 1 to 10, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected and no autocorrelation was found on the set of residuals from the ARMA model. This means that the residuals are independently distributed and the model can be used for forecasting.

Predictions¶

In [735]:
# Getting predictions from ARMA model at a confidence level of 95%.
predictions = arma_model.get_forecast(len(y_test) + 1)
predictions_ci = predictions.conf_int(alpha = 0.05)
predictions_ci
Out[735]:
lower Net Sales upper Net Sales
2022-01-01 -73772.743621 -49648.092767
2022-04-01 10567.440730 43664.817097
2022-07-01 -18367.879147 14921.241361
2022-10-01 15502.807909 49082.785082
2023-01-01 -72873.369379 -37432.758688
2023-04-01 8887.030408 44779.177577
2023-07-01 -19407.430067 16633.904872
2023-10-01 14028.399318 50358.751266
In [736]:
# Getting predicted mean from confidence intervals from ARMA model
y_pred_arma_model = arma_model.predict(start = predictions_ci.index[0], end = predictions_ci.index[-1])
y_pred_arma_model
Out[736]:
2022-01-01   -61710.418194
2022-04-01    27116.128914
2022-07-01    -1723.318893
2022-10-01    32292.796496
2023-01-01   -55153.064034
2023-04-01    26833.103992
2023-07-01    -1386.762597
2023-10-01    32193.575292
Freq: QS-OCT, Name: predicted_mean, dtype: float64
In [737]:
# Calculating predictions at the original scale of WALMEX net sales
y_pred_arma_model = y_diff.loc[y_train.index[-1]].values + y_pred_arma_model.cumsum()
y_pred_arma_model = y.loc[y_train.index[-1]] + y_pred_arma_model.cumsum()
y_pred_arma_model
Out[737]:
2022-01-01    189838.581806
2022-04-01    193730.292526
2022-07-01    195898.684352
2022-10-01    230359.872675
2023-01-01    209667.996963
2023-04-01    215809.225244
2023-07-01    220563.690928
2023-10-01    257511.731903
Freq: QS-OCT, Name: predicted_mean, dtype: float64

Model Assessment¶

The predictions were plot against the historical net sales data to visually assess the performance of the ARMA model.

In [738]:
plot_predictions(y_pred_arma_model, 'ARMA Model')
2024-11-01T17:47:06.104802 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plot above, the ARMA model was able to provide very close predictions for the WALMEX net sales.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:

In [739]:
arma_rmse, arma_mae, arma_r2 = calculate_scores(pd.DataFrame(y_pred_arma_model))
net_sales
RMSE: 5006.673
MAE: 4190.297
Coefficient of Determination: 0.942

In [740]:
# Adding scores to lists
rmse_list.append(arma_rmse[0])
mae_list.append(arma_mae[0])
r2_list.append(arma_r2[0])

6.5 Autoregressive Integrated Moving Average (ARIMA) Model ¶

Modeling Technique ¶

An Autoregressive Integrated Moving Average (ARIMA) model based on the WALMEX net sales time series was built.

The ARIMA is a combination of the autoregressive process, integration and the moving average process for forecasting of non-stationary time series, meaning that the time series has a trend, and/or its variance is not constant. (Peixeiro, 2022). It is denoted as ARIMA($p$,$d$, $q$); where $p$ is the order of the autoregressive process, $d$ is the integration order, and $q$ is the order of the moving average process. The order of $p$ determines the number of past values that affect the present value, the order of $q$ determines the number of past error terms that affect the present value, and the order of integration $d$ indicates the number of times a time series has been differenced to become stationary (Peixeiro, 2022).

Similar to the building of the ARMA model, several orders $p$ and $q$, for the autoregressive and moving average portions of the model, respectively, were tested and the best performing model was selected according to the Akaike Information Criterion (AIC) and a residual analysis (Peixeiro, 2022).

The function ARIMA from the library statsmodels in Python was used to build the ARIMA model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

Modeling Assumptions ¶

It is assumed that the present values in the time series are linearly dependend on both its past values and on the mean, the current error and the past errors terms (Peixeiro, 2022).

It was also assumed that the residuals from the model were normally distributed, independent, and uncorrelated; approximanting to white noise (Peixeiro, 2022).

On the other hand, from the EDA, it was determined that the order of integration for the WALMEX net sales is $d = 2$.

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [741]:
time_series = y

cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]

Model Building ¶

As several values for $p$ and $q$ were tested, a function to fit several ARIMA models and calculate their corresponding AIC score was defined. However, unlike the ARMA function, a new parameter $d$ was included in the ARIMA function.

In [742]:
def optimize_arima_model(p: range, d: int, q: range, time_series: pd.Series) -> ARIMA:
        """
        Optimize an autoregressive integrated moving average (ARIMA) model based on the Akaike Information Criterion (AIC), 
        given a set of p and q values; while keeping the d order constant. 

        Parameters:
        p (range): Range for order p in the autoregressive portion of the ARIMA model.
        d (int): Integration order.
        q (range): Range for order q in the moving average portion of the ARIMA model.
        time_series (pandas.Series): Time series data for fitting the ARIMA model.

        Returns:
        arima_model (statsmodels.tsa.arima.model.ARIMA): An ARIMA object fitted according to the combination of p and q that minimizes 
        the Akaike Information Criterion (AIC).
        

        """
        # Obtaining the combinations of p and q
        order_list = list(product(p, q))

        # Creating emtpy lists to store results
        order_results = []
        aic_results = []

        # Fitting models
        for order in order_list:

                arima_model = ARIMA(time_series, order = (order[0], d, order[1])).fit()
                order_results.append((order[0], d, order[1]))
                aic_results.append(arima_model.aic)
        
        # Converting lists to dataframes
        results = pd.DataFrame({'(p,d,q)': order_results,
                                'AIC': aic_results                                
                                })        
        # Storing results from the best model
        lowest_aic = results.AIC.min()
        best_model = results.loc[results['AIC'] == lowest_aic, ['(p,d,q)']].values[0][0]

        # Printing results
        print(f'The best model is (p = {best_model[0]}, d = {d}, q = {best_model[2]}), with an AIC of {lowest_aic:.02f}.\n')         
        print(results)     

        # Fitting best model again
        arima_model = ARIMA(time_series, order = (best_model[0], best_model[1], best_model[2])).fit()

        return arima_model

Then, several ARIMA models were fitted using the previously identified significant lags plus one according to the ACF and PACF plots:

  • $p$: From 1 to 5
  • $q$: From 1 to 6
In [743]:
arima_model = optimize_arima_model(p=range(1,6), d=2, q=range(1,7), time_series=y_train)
The best model is (p = 3, d = 2, q = 1), with an AIC of 688.12.

      (p,d,q)          AIC
0   (1, 2, 1)   690.735239
1   (1, 2, 2)   693.351496
2   (1, 2, 3)   695.276194
3   (1, 2, 4)   694.769638
4   (1, 2, 5)   692.507787
5   (1, 2, 6)   696.912015
6   (2, 2, 1)   692.110715
7   (2, 2, 2)   694.979680
8   (2, 2, 3)   696.205633
9   (2, 2, 4)   696.308298
10  (2, 2, 5)   694.424715
11  (2, 2, 6)   714.181864
12  (3, 2, 1)   688.122065
13  (3, 2, 2)   695.939508
14  (3, 2, 3)   698.538911
15  (3, 2, 4)  1068.206667
16  (3, 2, 5)   931.741421
17  (3, 2, 6)  1417.464562
18  (4, 2, 1)   703.977340
19  (4, 2, 2)   787.877988
20  (4, 2, 3)  1001.862630
21  (4, 2, 4)  1056.239218
22  (4, 2, 5)   910.280141
23  (4, 2, 6)  6785.936282
24  (5, 2, 1)   690.665518
25  (5, 2, 2)   826.874334
26  (5, 2, 3)  1034.760249
27  (5, 2, 4)  1109.219729
28  (5, 2, 5)   923.357552
29  (5, 2, 6)  8630.752800

So, the model ARIMA(3, 2, 1) was found to be the best model relative to all others models fit.

In [744]:
arima_model.summary()
Out[744]:
SARIMAX Results
Dep. Variable: net_sales No. Observations: 32
Model: ARIMA(3, 2, 1) Log Likelihood -339.061
Date: Fri, 01 Nov 2024 AIC 688.122
Time: 17:47:15 BIC 695.128
Sample: 01-01-2014 HQIC 690.363
- 10-01-2021
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 -0.9986 0.513 -1.948 0.051 -2.004 0.006
ar.L2 -0.8087 0.365 -2.217 0.027 -1.524 -0.094
ar.L3 -0.5638 0.129 -4.376 0.000 -0.816 -0.311
ma.L1 0.0512 0.899 0.057 0.955 -1.711 1.813
sigma2 3.413e+08 5.31e-09 6.42e+16 0.000 3.41e+08 3.41e+08
Ljung-Box (L1) (Q): 0.60 Jarque-Bera (JB): 38.39
Prob(Q): 0.44 Prob(JB): 0.00
Heteroskedasticity (H): 0.28 Skew: -1.54
Prob(H) (two-sided): 0.06 Kurtosis: 7.61


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 5.38e+32. Standard errors may be unstable.

Residual Analysis¶

The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).

To do so, the residual analysis is performed in two steps (Peixeiro, 2022):

  • Quantile-quantile plot (Q-Q plot): To qualitatively assess whether the residuals from the model are normally distributed.
  • Ljung-Box test: To test whether the residuals from the model are uncorrelated.
In [745]:
# Storing residuals
residuals = arima_model.resid
In [746]:
# Plotting QQ-Plot
chart_title = 'Diagnostic plots for standardized residuals from ARIMA model'
arima_model.plot_diagnostics(figsize=(10, 8))
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:47:19.853801 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the diagnostics plots, the residuals are not completely normally distributed. However, it was considered that the results were good enough for the purposes of the present study.

Then, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:

$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$
In [747]:
# Ljung-Box test of the residuals on 10 lags
acorr_ljungbox(residuals, np.arange(1, 11))
Out[747]:
lb_stat lb_pvalue
1 0.996298 0.318208
2 1.272624 0.529241
3 1.287834 0.732024
4 2.389698 0.664490
5 2.507422 0.775377
6 2.584535 0.858889
7 2.584620 0.920591
8 3.390625 0.907511
9 3.550224 0.938451
10 3.915916 0.951060

For each of the lags from 1 to 10, the $p\text{-}values$ were well above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the ARIMA model. Thus, the residuals are independently distributed and the model can be used for forecasting.

Predictions¶

In [748]:
# Getting predictions from ARIMA model at a confidence level of 95%.
predictions = arima_model.get_forecast(len(y_test) + 1)
predictions_ci = predictions.conf_int(alpha = 0.05)
predictions_ci
Out[748]:
lower net_sales upper net_sales
2022-01-01 165173.606625 237593.181281
2022-04-01 158677.779068 263820.459112
2022-07-01 149950.171598 288340.797115
2022-10-01 151955.606075 327783.341282
2023-01-01 120735.684089 353721.126484
2023-04-01 106785.634575 390527.087914
2023-07-01 88968.278221 426427.498843
2023-10-01 73308.793181 468528.597704
In [749]:
# Getting predicted mean from confidence intervals from ARIMA model
y_pred_arima_model = arima_model.predict(start=predictions_ci.index[0], end=predictions_ci.index[-1])
y_pred_arima_model
Out[749]:
2022-01-01    201383.393953
2022-04-01    211249.119090
2022-07-01    219145.484356
2022-10-01    239869.473678
2023-01-01    237228.405287
2023-04-01    248656.361244
2023-07-01    257697.888532
2023-10-01    270918.695443
Freq: QS-OCT, Name: predicted_mean, dtype: float64

In this case, unlike the previous models, it was not necessary to bring the predictions back to the original scale of the series, as the ARIMA model handled that by itself.

Model Assessment¶

The predictions were plot against the historical net sales data to visually assess the performance of the ARIMA model.

In [750]:
plot_predictions(y_pred_arima_model, 'ARIMA Model')
2024-11-01T17:47:22.236803 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plot above, is seems that the ARIMA model was not able to capture the seasonality of the time series. Notably, the ARMA model was able to yield even better predictions.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:

In [751]:
arima_rmse, arima_mae, arima_r2 = calculate_scores(pd.DataFrame(y_pred_arima_model))
net_sales
RMSE: 27274.028
MAE: 24120.853
Coefficient of Determination: -0.720

In [752]:
# Adding scores to lists
rmse_list.append(arima_rmse[0])
mae_list.append(arima_mae[0])
r2_list.append(arima_r2[0])

6.6 Seasonal Autoregressive Integrated Moving Average (SARIMA) Model ¶

Modeling Technique ¶

A Seasonal Autoregressive Integrated Moving Average (SARIMA) model based on the WALMEX net sales time series was built.

The SARIMA model is a combination of the autoregressive process, integration and the moving average process for forecasting of non-stationary time series, but also accounting for seasonal patterns. (Peixeiro, 2022). It is denoted as $\bold{\text{SARIMA}(p,d,q)(P,D,Q)_{m}}$; where $p$ is the order of the autoregressive process, $d$ is the integration order, $q$ is the order of the moving average process, $m$ is the frequency, and $P$, $D$, and $Q$ are the orders for the seasonal parameters for the autoregressive, integration, and moving average processes, respectively (Peixeiro, 2022).

Similar to the building of the ARMA and ARIMA models, several orders $p$, $q$, $P$, and $Q$ were tested and the best performing model was selected according to the Akaike Information Criterion (AIC) and a residual analysis (Peixeiro, 2022).

The function SARIMAX from the library statsmodels in Python was used to build the SARIMA model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

Modeling Assumptions ¶

It is assumed that the present values in the time series are linearly dependend on both its past values and on the mean, the current error and the past errors terms (Peixeiro, 2022).

It was also assumed that the residuals from the model were normally distributed, independent, and uncorrelated; approximanting to white noise (Peixeiro, 2022).

Furthermore, as the time series is reported in a quaterly basis, the frequency or number of observations per seasonal cycle was defined as $m = 4$. This assumption was also supported by the additive decomposition of the time series carried out above.

Finally, from the EDA, it was found that the original time series was not stationary. In this sense, it was proposed to use the first-differenced time series as a basis for seasonal differencing. So, the order of integration for the WALMEX net sales was $d = 1$ for the SARIMA model.

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [753]:
time_series = y

cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]

Seasonal Differencing¶

The first-differenced time series was seasonally differenced using $m = 4$, and the stationarity was tested using ADF test.

In [754]:
y_season_diff = y_diff.diff(periods=4).dropna()

stationarity_test = adfuller(y_season_diff, autolag='AIC')
print("ADF statistic: ", stationarity_test[0])
print("P-value: ", stationarity_test[1])
ADF statistic:  -3.0522399619417264
P-value:  0.030295538176182014

The ADF statistic is strongly negative and the $p\text{-}value < 0.05$, so the null hypothesis that the time series is not statiotionary is rejected.

In this context, the order of seasonal integration is $D = 1$.

Model Building ¶

As several values for $p$, $q$, $P$, and $Q$ were tested, a function to fit several SARIMA models and calculate their corresponding AIC score was defined.

In [755]:
def optimize_sarima_model(p: range, d: int, q: range, P: range, D: int, Q: range, m: int, time_series: pd.Series) -> SARIMAX:
        """
        Optimize a Seasonal Autoregressive Integrated Moving Average (SARIMA) model based on the Akaike Information Criterion (AIC), 
        given a set of p, q, P, and Q values; while keeping the d and D orders, and the frequency m constant.        

        Parameters:
        p (range): Range for order p in the autoregressive portion of the SARIMA model.
        d (int): Integration order.
        q (range): Range for order q in the moving average portion of the SARIMA model.
        P (range): Range for order P in the seasonal autoregressive portion of the SARIMA model.
        D (int): Seasonal integration order.
        Q (range): Range for order P in the seasonal moving average portion of the SARIMA model.
        m (int): Number of observations per seasonal cycle.
        time_series (pandas.Series): Time series data for fitting the SARIMA model.

        Returns:
        sarima_model (statsmodels.tsa.statespace.sarimax.SARIMAX): An SARIMAX object fitted according to the combination of p, q, P, 
        and Q values that minimizes the Akaike Information Criterion (AIC).
        

        """
        # Obtaining the combinations of p and q
        order_list = list(product(p, q, P, Q))

        # Creating emtpy lists to store results
        order_results = []
        aic_results = []

        # Fitting models
        for order in order_list:

                sarima_model = SARIMAX(endog=time_series, 
                                       order = (order[0], d, order[1]),
                                       seasonal_order=(order[2], D, order[3], m),
                                       ).fit(disp=False)
                order_results.append((order[0], d, order[1], order[2], D, order[3], m))
                aic_results.append(sarima_model.aic)
        
        # Converting lists to dataframes
        results = pd.DataFrame({'(p,d,q)(P,D,Q)m': order_results,
                                'AIC': aic_results                                
                                })        
        # Storing results from the best model
        lowest_aic = results.AIC.min()
        best_model = results.loc[results['AIC'] == lowest_aic, ['(p,d,q)(P,D,Q)m']].values[0][0]

        # Printing results
        print(f'The best model is (p = {best_model[0]}, d = {d}, q = {best_model[2]})(P = {best_model[3]}, D = {D}, Q = {best_model[5]})(m = {m}), with an AIC of {lowest_aic:.02f}.\n')         
        print(results)     

        # Fitting best model again
        sarima_model = SARIMAX(endog=time_series, 
                                order = (best_model[0], d, best_model[2]),
                                seasonal_order=(best_model[3], D, best_model[5], m),
                                ).fit(disp=False)

        return sarima_model

Then, several SARIMA models were fitted using the order from 1 to 4, as the time series is reported in a quarterly basis:

  • $p$: From 1 to 3
  • $q$: From 1 to 3
  • $P$: From 1 to 4
  • $Q$: From 1 to 4
In [756]:
sarima_model = optimize_sarima_model(p=range(1,4), 
                                    d=1, 
                                    q=range(1,4), 
                                    P=range(1,5), 
                                    D=1, 
                                    Q=range(1,5), 
                                    m=4, 
                                    time_series=y_train)
The best model is (p = 1, d = 1, q = 1)(P = 1, D = 1, Q = 1)(m = 4), with an AIC of 555.61.

           (p,d,q)(P,D,Q)m         AIC
0    (1, 1, 1, 1, 1, 1, 4)  555.609844
1    (1, 1, 1, 1, 1, 2, 4)  557.206290
2    (1, 1, 1, 1, 1, 3, 4)  560.066934
3    (1, 1, 1, 1, 1, 4, 4)  561.218828
4    (1, 1, 1, 2, 1, 1, 4)  557.223439
..                     ...         ...
139  (3, 1, 3, 3, 1, 4, 4)  565.349865
140  (3, 1, 3, 4, 1, 1, 4)  567.793619
141  (3, 1, 3, 4, 1, 2, 4)  564.148215
142  (3, 1, 3, 4, 1, 3, 4)  565.327303
143  (3, 1, 3, 4, 1, 4, 4)  568.005094

[144 rows x 2 columns]

So, the model $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ was found to be the best model relative to all others models fit.

In [757]:
sarima_model.summary()
Out[757]:
SARIMAX Results
Dep. Variable: net_sales No. Observations: 32
Model: SARIMAX(1, 1, 1)x(1, 1, 1, 4) Log Likelihood -272.805
Date: Fri, 01 Nov 2024 AIC 555.610
Time: 17:48:53 BIC 562.089
Sample: 01-01-2014 HQIC 557.536
- 10-01-2021
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 0.9072 0.202 4.488 0.000 0.511 1.303
ma.L1 -0.9847 0.495 -1.987 0.047 -1.956 -0.014
ar.S.L4 -0.8204 1.819 -0.451 0.652 -4.386 2.745
ma.S.L4 0.8012 1.859 0.431 0.666 -2.842 4.445
sigma2 4.237e+07 1.69e-08 2.51e+15 0.000 4.24e+07 4.24e+07
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 0.34
Prob(Q): 0.94 Prob(JB): 0.84
Heteroskedasticity (H): 3.65 Skew: -0.25
Prob(H) (two-sided): 0.07 Kurtosis: 3.23


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 7.46e+31. Standard errors may be unstable.

Residual Analysis¶

The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).

To do so, the residual analysis is performed in two steps (Peixeiro, 2022):

  • Quantile-quantile plot (Q-Q plot): To qualitatively assess whether the residuals from the model are normally distributed.
  • Ljung-Box test: To test whether the residuals from the model are uncorrelated.
In [758]:
# Storing residuals
residuals = sarima_model.resid
In [759]:
# Plotting QQ-Plot
chart_title = 'Diagnostic plots for standardized residuals from SARIMA model'
sarima_model.plot_diagnostics(figsize=(10, 8))
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:48:58.646801 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the diagnostics plots, the residuals are not completely normally distributed. However, it was considered that the results were good enough for the purposes of the present study.

Then, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:

$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$
In [760]:
# Ljung-Box test of the residuals on 10 lags
acorr_ljungbox(residuals, np.arange(1, 11))
Out[760]:
lb_stat lb_pvalue
1 0.296909 0.585827
2 0.297667 0.861713
3 0.518560 0.914795
4 8.331227 0.080171
5 8.331242 0.138900
6 8.345408 0.213874
7 8.346893 0.303001
8 8.347182 0.400312
9 8.355757 0.498730
10 8.571613 0.573183

For each of the lags from 1 to 10, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the SARIMA model. Thus, the residuals are independently distributed and the model can be used for forecasting.

Predictions¶

In [761]:
y_pred_sarima_model = sarima_model.get_prediction(start=len(y_train), end=(len(y_train)+len(y_test))).predicted_mean
y_pred_sarima_model
Out[761]:
2022-01-01    188700.036884
2022-04-01    192431.179204
2022-07-01    192886.115990
2022-10-01    231412.118662
2023-01-01    206087.318747
2023-04-01    209966.989976
2023-07-01    210934.267963
2023-10-01    249338.432491
Freq: QS-OCT, Name: predicted_mean, dtype: float64

As in the ARIMA model, it was not necessary to bring the predictions back to the original scale of the series, as the SARIMA model handled that process by itself.

Model Assessment¶

The predictions were plot against the historical net sales data to visually assess the performance of the SARIMA model.

In [762]:
plot_predictions(y_pred_sarima_model, 'SARIMA Model')
2024-11-01T17:49:01.704803 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plot above, the SARIMA model was able to neatly capture the trend and seasonality of the time series. So far, the best performance obtained.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:

In [763]:
sarima_rmse, sarima_mae, sarima_r2 = calculate_scores(pd.DataFrame(y_pred_sarima_model))
net_sales
RMSE: 2675.576
MAE: 2372.531
Coefficient of Determination: 0.983

In [764]:
# Adding scores to lists
rmse_list.append(sarima_rmse[0])
mae_list.append(sarima_mae[0])
r2_list.append(sarima_r2[0])

6.7 Seasonal Autoregressive Integrated Moving Average with Exogenous Variables (SARIMAX) Model ¶

Modeling Technique ¶

A Seasonal Autoregressive Integrated Moving Average with Exogenous Variables (SARIMAX) model based on the WALMEX net sales, units, GDP, stock value, SP&500, the IPC, exchange rates, and interest rates time series was built.

The SARIMAX model is a combination of the autoregressive process, integration, moving average process for forecasting of non-stationary time series with seasonal patterns, but also including the effects of external variables (Peixeiro, 2022). It is denoted as $\bold{\text{SARIMA}(p,d,q)(P,D,Q)_{m} + \sum_{i=1} ^{n} \beta_i X_t^i}$; where $p$ is the order of the autoregressive process, $d$ is the integration order, $q$ is the order of the moving average process, $m$ is the frequency; $P$, $D$, and $Q$ are the orders for the seasonal parameters for the autoregressive, integration, and moving average processes, respectively; and $X_t$ are any number of exogenous variables with their corresponding coefficients $\beta$ (Peixeiro, 2022).

Similar to the building of the ARMA, ARIMA, and SARIMA models, several orders $p$, $q$, $P$, and $Q$ were tested and the best performing model was selected according to the Akaike Information Criterion (AIC) and a residual analysis (Peixeiro, 2022).

The function SARIMAX from the library statsmodels in Python was used to build the SARIMAX model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

Modeling Assumptions ¶

It is assumed that the present values in the time series are linearly dependend on both its past values and on the mean, the current error and the past errors terms (Peixeiro, 2022).

It was also assumed that the residuals from the model were normally distributed, independent, and uncorrelated; approximanting to white noise (Peixeiro, 2022).

Furthermore, from the EDA, it was found that the original time series was not stationary. In this sense, it was proposed to use the first-differenced time series as a basis for seasonal differencing. So, the order of integration for the WALMEX net sales was $d = 1$ for the SARIMAX model.

As the time series is reported in a quaterly basis, the frequency or number of observations per seasonal cycle was defined as $m = 4$. This assumption was also supported by the additive decomposition of the time series carried out above.

Finally, from the seasonal differencing carried out above at the SARIMA model, the seasonal order of integration was defined as $D = 1$.

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [765]:
# Endogeneous variable
time_series = y

cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]
In [766]:
# Exogeneous variables
exog = X.copy()
exog_train = exog.iloc[:cutoff]
exog_test = exog.iloc[cutoff:]

Model Building ¶

As several values for $p$, $q$, $P$, and $Q$ were tested, a function to fit several SARIMAX models and calculate their corresponding AIC score was defined.

In [767]:
def optimize_sarimax_model(p: range, d: int, q: range, P: range, D: int, Q: range, m: int, endog: pd.Series, exog: pd.DataFrame) -> SARIMAX:
        """
        Optimize a Seasonal Autoregressive Integrated Moving Average with Exogeneous Variables (SARIMAX) model based on the Akaike Information Criterion (AIC), 
        given a set of p, q, P, and Q values; while keeping the d and D orders, and the frequency m constant.

        Parameters:
        p (range): Range for order p in the autoregressive portion of the SARIMA model.
        d (int): Integration order.
        q (range): Range for order q in the moving average portion of the SARIMA model.
        P (range): Range for order P in the seasonal autoregressive portion of the SARIMA model.
        D (int): Seasonal integration order.
        Q (range): Range for order P in the seasonal moving average portion of the SARIMA model.
        m (int): Number of observations per seasonal cycle.
        endog (pandas.Series): Time series of the endogenous variable for fitting the SARIMAX model.
        exog (pandas.DataFrame): Time series of the exogenous variables for fitting the SARIMAX model.

        Returns:
        sarimax_model (statsmodels.tsa.statespace.sarimax.SARIMAX): An SARIMAX model object fitted according to the combination of p, q, P, 
        and Q values that minimizes the Akaike Information Criterion (AIC).
        

        """
        # Obtaining the combinations of p and q
        order_list = list(product(p, q, P, Q))

        # Creating emtpy lists to store results
        order_results = []
        aic_results = []

        # Fitting models
        for order in order_list:

                sarimax_model = SARIMAX(endog=endog, 
                                        exog=exog,
                                       order = (order[0], d, order[1]),
                                       seasonal_order=(order[2], D, order[3], m),
                                       ).fit(disp=False)
                order_results.append((order[0], d, order[1], order[2], D, order[3], m))
                aic_results.append(sarimax_model.aic)
        
        # Converting lists to dataframes
        results = pd.DataFrame({'(p,d,q)(P,D,Q)m': order_results,
                                'AIC': aic_results                                
                                })        
        # Storing results from the best model
        lowest_aic = results.AIC.min()
        best_model = results.loc[results['AIC'] == lowest_aic, ['(p,d,q)(P,D,Q)m']].values[0][0]

        # Printing results
        print(f'The best model is (p = {best_model[0]}, d = {d}, q = {best_model[2]})(P = {best_model[3]}, D = {D}, Q = {best_model[5]})(m = {m}), with an AIC of {lowest_aic:.02f}.\n')         
        print(results)     

        # Fitting best model again
        sarimax_model = SARIMAX(endog=endog, 
                                exog=exog, 
                                order = (best_model[0], d, best_model[2]),
                                seasonal_order=(best_model[3], D, best_model[5], m),
                                ).fit(disp=False)

        return sarimax_model

Then, several SARIMAX models were fitted using the order from 1 to 4, as the time series is reported in a quarterly basis:

  • $p$: From 1 to 3
  • $q$: From 1 to 3
  • $P$: From 1 to 2
  • $Q$: From 1 to 2
In [768]:
sarimax_model = optimize_sarimax_model(p=range(1,4), 
                                        d=1, 
                                        q=range(1,4), 
                                        P=range(1,3), 
                                        D=1, 
                                        Q=range(1,3), 
                                        m=4, 
                                        endog=y_train,
                                        exog=exog_train)
The best model is (p = 1, d = 1, q = 2)(P = 2, D = 1, Q = 1)(m = 4), with an AIC of 529.08.

          (p,d,q)(P,D,Q)m         AIC
0   (1, 1, 1, 1, 1, 1, 4)  531.525500
1   (1, 1, 1, 1, 1, 2, 4)  531.456966
2   (1, 1, 1, 2, 1, 1, 4)  529.357752
3   (1, 1, 1, 2, 1, 2, 4)  533.898865
4   (1, 1, 2, 1, 1, 1, 4)  533.679524
5   (1, 1, 2, 1, 1, 2, 4)  532.698779
6   (1, 1, 2, 2, 1, 1, 4)  529.077368
7   (1, 1, 2, 2, 1, 2, 4)  534.696033
8   (1, 1, 3, 1, 1, 1, 4)  536.454624
9   (1, 1, 3, 1, 1, 2, 4)  541.933161
10  (1, 1, 3, 2, 1, 1, 4)  531.658909
11  (1, 1, 3, 2, 1, 2, 4)  535.743332
12  (2, 1, 1, 1, 1, 1, 4)  533.436486
13  (2, 1, 1, 1, 1, 2, 4)  533.594133
14  (2, 1, 1, 2, 1, 1, 4)  531.104424
15  (2, 1, 1, 2, 1, 2, 4)  535.895220
16  (2, 1, 2, 1, 1, 1, 4)  535.425705
17  (2, 1, 2, 1, 1, 2, 4)  534.324703
18  (2, 1, 2, 2, 1, 1, 4)  531.388931
19  (2, 1, 2, 2, 1, 2, 4)  536.739133
20  (2, 1, 3, 1, 1, 1, 4)  655.107508
21  (2, 1, 3, 1, 1, 2, 4)  549.104998
22  (2, 1, 3, 2, 1, 1, 4)  662.450191
23  (2, 1, 3, 2, 1, 2, 4)  542.492387
24  (3, 1, 1, 1, 1, 1, 4)  534.709970
25  (3, 1, 1, 1, 1, 2, 4)  533.885045
26  (3, 1, 1, 2, 1, 1, 4)  531.231438
27  (3, 1, 1, 2, 1, 2, 4)  535.461910
28  (3, 1, 2, 1, 1, 1, 4)  537.236569
29  (3, 1, 2, 1, 1, 2, 4)  538.352605
30  (3, 1, 2, 2, 1, 1, 4)  536.092496
31  (3, 1, 2, 2, 1, 2, 4)  539.216940
32  (3, 1, 3, 1, 1, 1, 4)  539.914609
33  (3, 1, 3, 1, 1, 2, 4)  537.444339
34  (3, 1, 3, 2, 1, 1, 4)  540.687811
35  (3, 1, 3, 2, 1, 2, 4)  538.993444

So, the model $\text{SARIMA}(1,1,1)(1,1,1)_{4} + \sum_{i=1} ^{n} \beta_i X_t^i$ was found to be the best model relative to all others models fit.

In [769]:
sarimax_model.summary()
Out[769]:
SARIMAX Results
Dep. Variable: net_sales No. Observations: 32
Model: SARIMAX(1, 1, 2)x(2, 1, [1], 4) Log Likelihood -250.539
Date: Fri, 01 Nov 2024 AIC 529.077
Time: 17:49:31 BIC 547.219
Sample: 01-01-2014 HQIC 534.472
- 10-01-2021
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
units 56.2573 13.751 4.091 0.000 29.306 83.209
gdp 0.0007 0.001 0.586 0.558 -0.002 0.003
walmex 227.9496 330.887 0.689 0.491 -420.576 876.476
sp500 -17.1477 12.304 -1.394 0.163 -41.263 6.968
ipc 0.7143 0.634 1.127 0.260 -0.528 1.957
exchange_rates 666.4456 1582.027 0.421 0.674 -2434.271 3767.162
interest_rates 1893.6433 1850.806 1.023 0.306 -1733.870 5521.156
ar.L1 0.4742 1.535 0.309 0.757 -2.535 3.484
ma.L1 -0.4881 1.507 -0.324 0.746 -3.442 2.466
ma.L2 0.0606 0.281 0.216 0.829 -0.489 0.610
ar.S.L4 -0.7330 0.704 -1.042 0.298 -2.112 0.646
ar.S.L8 -0.1085 0.182 -0.596 0.551 -0.466 0.248
ma.S.L4 0.6216 0.791 0.786 0.432 -0.928 2.172
sigma2 7.662e+06 0.198 3.86e+07 0.000 7.66e+06 7.66e+06
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 0.54
Prob(Q): 0.93 Prob(JB): 0.76
Heteroskedasticity (H): 2.33 Skew: -0.34
Prob(H) (two-sided): 0.22 Kurtosis: 2.89


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.21e+24. Standard errors may be unstable.

Residual Analysis¶

The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).

To do so, the residual analysis is performed in two steps (Peixeiro, 2022):

  • Quantile-quantile plot (Q-Q plot): To qualitatively assess whether the residuals from the model are normally distributed.
  • Ljung-Box test: To test whether the residuals from the model are uncorrelated.
In [770]:
# Storing residuals
residuals = sarimax_model.resid
In [771]:
# Plotting QQ-Plot
chart_title = 'Diagnostic plots for standardized residuals from SARIMAX model'
sarimax_model.plot_diagnostics(figsize=(10, 8))
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:49:36.118803 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the diagnostics plots, the residuals are not completely normally distributed. However, it was considered that the results were good enough for the purposes of the present study.

Then, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:

$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$
In [772]:
# Ljung-Box test of the residuals on 10 lags
acorr_ljungbox(residuals, np.arange(1, 11))
Out[772]:
lb_stat lb_pvalue
1 0.508670 0.475715
2 0.518272 0.771718
3 1.513775 0.679095
4 8.654437 0.070342
5 8.655764 0.123608
6 8.676066 0.192632
7 8.676392 0.276732
8 8.681004 0.369916
9 8.683334 0.467004
10 8.696511 0.561128

For each of the lags from 1 to 10, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the SARIMAX model. Thus, the residuals are independently distributed and the model can be used for forecasting.

Predictions¶

Unlike the other models, to get predictions for WALMEX net sales series it was necessary to provide the actual values for the exogenous variables for the testing time window:

In [773]:
y_pred_sarimax_model = sarimax_model.get_prediction(start=len(y_train), end=(len(y_train)+len(y_test)), exog=exog_test).predicted_mean
y_pred_sarimax_model
Out[773]:
2022-01-01    193454.670515
2022-04-01    206073.324666
2022-07-01    210553.732409
2022-10-01    256600.961779
2023-01-01    233101.893331
2023-04-01    236312.237932
2023-07-01    232991.278696
2023-10-01    273045.912723
Freq: QS-OCT, Name: predicted_mean, dtype: float64

Thus, for forecasting the WALMEX net sales in the future, with yet-to-know values for the exogenous variables, it would be necessary to forecast them firstly as well to feed the SARIMAX model. Of course, this could cause the prediction errors to accumulate, rendering the SARIMAX model unreliable (Peixeiro, 2022). So, it is recommended to use the SARIMAX model for forecasting only the next period in future (Peixeiro, 2022).

Model Assessment¶

The predictions were plot against the historical net sales data to visually assess the performance of the SARIMA model.

In [774]:
plot_predictions(y_pred_sarimax_model, 'SARIMAX Model')
2024-11-01T17:49:39.124685 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plot above, the SARIMAX model was able to capture the trend and seasonality of the time series. However, its performance was inferior to that from the SARIMA model.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:

In [775]:
sarimax_rmse, sarimax_mae, sarimax_r2 = calculate_scores(pd.DataFrame(y_pred_sarimax_model))
net_sales
RMSE: 19736.315
MAE: 18619.002
Coefficient of Determination: 0.099

In [776]:
# Adding scores to lists
rmse_list.append(sarimax_rmse[0])
mae_list.append(sarimax_mae[0])
r2_list.append(sarimax_r2[0])

6.8 Simple Exponential Smoothing (SES) Model ¶

Modeling Technique ¶

A Simple Expotential Smoothing (SES) model based on the WALMEX net sales was built.

The SES model is a smoothening technique that uses an exponential window function (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023). It is useful when a time series do not exhibit neither trend nor seasonality (Atwan, 2022).

The function SimpleExpSmoothing from the library statsmodels in Python was used to build the SES model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

Modeling Assumptions ¶

As the SES model requires a time series process to be stationary. So, when using the second-order differenced series, it was assumed that the stationarity requirement was fulfilled.

Furthermore, it was assumed that the seasonal patterns were neglectable for the purposes of this model.

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [777]:
time_series = y_diff2

cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]

Model Building ¶

In [778]:
# Model Fitting
ses_model = SimpleExpSmoothing(y_train).fit()
ses_model.summary()
Out[778]:
SimpleExpSmoothing Model Results
Dep. Variable: Net Sales No. Observations: 30
Model: SimpleExpSmoothing SSE 29041808623.000
Optimized: True AIC 624.724
Trend: None BIC 627.527
Seasonal: None AICC 626.324
Seasonal Periods: None Date: Fri, 01 Nov 2024
Box-Cox: False Time: 17:49:39
Box-Cox Coeff.: None
coeff code optimized
smoothing_level 0.0050000 alpha True
initial_level -828.00000 l.0 False

Predictions¶

In [779]:
y_pred_ses_model = ses_model.forecast(len(y_test) + 1)
y_pred_ses_model
Out[779]:
2022-01-01   -529.793623
2022-04-01   -529.793623
2022-07-01   -529.793623
2022-10-01   -529.793623
2023-01-01   -529.793623
2023-04-01   -529.793623
2023-07-01   -529.793623
2023-10-01   -529.793623
Freq: QS-OCT, dtype: float64
In [780]:
# Detransforming predictions to original series scale
y_pred_ses_model = y_diff.loc[y_train.index[-1]].values + y_pred_ses_model.cumsum()
y_pred_ses_model = y.loc[y_train.index[-1]] + y_pred_ses_model.cumsum()
y_pred_ses_model
Out[780]:
2022-01-01    251019.206377
2022-04-01    288445.619130
2022-07-01    325342.238261
2022-10-01    361709.063768
2023-01-01    397546.095652
2023-04-01    432853.333913
2023-07-01    467630.778551
2023-10-01    501878.429566
Freq: QS-OCT, dtype: float64

Model Assessment¶

The predictions were plot against the historical net sales data to visually assess the performance of the SES model.

In [781]:
plot_predictions(y_pred_ses_model, 'SES Model')
2024-11-01T17:49:41.545949 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plot above, as expectable, the SES model was not able to capture neither the trend nor the seasonality of the time.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:

In [782]:
ses_rmse, ses_mae, ses_r2 = calculate_scores(pd.DataFrame(y_pred_ses_model))
net_sales
RMSE: 180107.784
MAE: 166655.346
Coefficient of Determination: -74.013

In [783]:
# Adding scores to lists
rmse_list.append(ses_rmse[0])
mae_list.append(ses_mae[0])
r2_list.append(ses_r2[0])

6.9 Holt-Winters (HW) Model ¶

Modeling Technique ¶

A Holt-Winters (HW) model based on the WALMEX net sales was built.

The HW model is a smoothening technique that uses an exponential weighted moving average process (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023). It is an enhancement over simple exponential smoothing, and can be used when a time series exhibit both a trend and seasonality (Atwan, 2022).

The function ExponentialSmoothing from the library statsmodels in Python was used to build the HW model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

Modeling Assumptions ¶

It has been assumed that the futures values of the time series are dependent upon the past errors terms.

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [784]:
time_series = y

cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff:]

Model Building ¶

In [785]:
# Model Fitting
hw_model = ExponentialSmoothing(y_train, trend='additive', seasonal='additive', seasonal_periods=4).fit()
hw_model.summary()
Out[785]:
ExponentialSmoothing Model Results
Dep. Variable: net_sales No. Observations: 32
Model: ExponentialSmoothing SSE 554637676.280
Optimized: True AIC 549.379
Trend: Additive BIC 561.105
Seasonal: Additive AICC 559.855
Seasonal Periods: 4 Date: Fri, 01 Nov 2024
Box-Cox: False Time: 17:49:41
Box-Cox Coeff.: None
coeff code optimized
smoothing_level 0.6767857 alpha True
smoothing_trend 0.0001 beta True
smoothing_seasonal 0.1010045 gamma True
initial_level 1.0762e+05 l.0 True
initial_trend 3002.1621 b.0 True
initial_seasons.0 -5954.8672 s.0 True
initial_seasons.1 -6667.7734 s.1 True
initial_seasons.2 -5347.6172 s.2 True
initial_seasons.3 17970.258 s.3 True

Predictions¶

In [786]:
y_pred_hw_model = hw_model.forecast(len(y_test))
y_pred_hw_model
Out[786]:
2022-01-01    187096.303296
2022-04-01    189832.517330
2022-07-01    193853.235303
2022-10-01    221905.834217
2023-01-01    199100.035656
2023-04-01    201836.249690
2023-07-01    205856.967663
2023-10-01    233909.566577
Freq: QS-OCT, dtype: float64

Model Assessment¶

The predictions were plot against the historical net sales data to visually assess the performance of the HW model.

In [787]:
plot_predictions(y_pred_hw_model, 'HW Model')
2024-11-01T17:49:43.868497 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plot above, the HW model performed notably better than the SES model, as it was able to capture both the trend and seasonality of the WALMEX net sales series.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:

In [788]:
hw_rmse, hw_mae, hw_r2 = calculate_scores(pd.DataFrame(y_pred_hw_model))
net_sales
RMSE: 9508.312
MAE: 7645.737
Coefficient of Determination: 0.791

In [789]:
# Adding scores to lists
rmse_list.append(hw_rmse[0])
mae_list.append(hw_mae[0])
r2_list.append(hw_r2[0])

6.10 Prophet Univariate Time Series Model ¶

Modeling Technique ¶

A Univariate Time Series model using Prophet based on the WALMEX net sales was built.

Prophet is an open source package built and maintained by Meta for forecasting at scale. It was released in 2017 (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023), and it is able to fit nonlinear trends and multiple seasonalities (Peixeiro, 2022).

Prophet implements a general additive model where a time series $y(t)$ is modeled as the linear combination of a trend $g(t)$, a seasonal component $s(t)$, holiday effects $h(t)$, and an error term $ϵ_{t}$ (Peixeiro, 2022):

$$y(t) = g(t) + s(t) + h(t) + ϵ_{t}$$

Even though the model does not take into account any autoregressive process, it has the advantage to be very flexible as "it can accommodate multiple seasonal periods and changing trends [and] it is robust to outliers and missing data" (Peixeiro, 2022):

The function Prophet from the library prophet in Python was used to build the model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

Modeling Assumptions ¶

It has been assumed that the net sales series has a non-linear trend, at least one seasonal component, holiday effects, and that the error terms are normally distributed.

Test Design ¶

Firstly, a dataframe with the training set was created with two columns, a date column named ds and a value column named y, as per requirements of the Prophet algorithm.

In [790]:
y_df = pd.DataFrame(y).rename(columns={'net_sales':'y'}).rename_axis('ds').reset_index()
y_df.head()
Out[790]:
ds y
0 2014-01-01 101405
1 2014-04-01 103300
2 2014-07-01 104367
3 2014-10-01 128586
4 2015-01-01 110875

Then, the dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [791]:
time_series = y_df

cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff:]

Model Building ¶

The model built with Prophet was optimized by using cross-validation and hyperparameter tuning.

Specifically, the hypeparameters changepoint_prior_scale and seasonality_prior_scale were tunned.

In [792]:
def optimize_prophet_model(time_series: pd.DataFrame, changepoint_prior_scale: List[float], seasonality_prior_scale: List[float], metric: str) -> Prophet:
        """
        Optimize an univariate time series model with Prophet, given a set of changepoint_prior_scale and 
        seasonality_prior_scale values.

        Parameters:
        changepoint_prior_scale (List[float]): Values for the changepoint_prior_scale hyperparameter in Prophet.
        seasonality_prior_scale (List[float]): Values for the seasonality_prior_scale hyperparameter in Prophet.        
        series (pandas.DataFrame): Time series data in two columns: ds for dates in a date datatype, and y for the series.
        metric (str): Selected performance metric for optimization: One of 'mse', 'rmse', 'mae', 'mdape', or 'coverage'.
        
        Returns:
        m (prophet.Prophet): An Prophet model object optimized according to the combination of tested hyperparameters, by using the 
        indicated metric.
        
        """
        # Obtaining the combinations of hyperparameters
        params = list(product(changepoint_prior_scale, seasonality_prior_scale))

        # Creating emtpy lists to store performance results        
        metric_results = []

        # Defining cutoff dates
        start_cutoff_percentage = 0.5 # 50% of the data will be used for fitting the model
        start_cutoff_index = int(round(len(time_series) * start_cutoff_percentage, 0))
        start_cutoff = time_series.iloc[start_cutoff_index].values[0] 
        end_cutoff = time_series.iloc[-4].values[0] # The last fourth value is taken as the series is reported in a quarterly basis

        cutoffs = pd.date_range(start=start_cutoff, end=end_cutoff, freq='12M')

        # Fitting models
        for param in params:
                m = Prophet(changepoint_prior_scale=param[0], seasonality_prior_scale=param[1])
                m.add_country_holidays(country_name='MX')
                m.fit(time_series)
        
                df_cv = cross_validation(model=m, horizon='365 days', cutoffs=cutoffs)
                df_p = performance_metrics(df_cv, rolling_window=1)
                metric_results.append(df_p[metric].values[0])

        # Converting list to dataframe
        results = pd.DataFrame({'Hyperparameters': params,
                                'Metric': metric_results                                
                                })     
           
        # Storing results from the best model
        best_params = params[np.argmin(metric_results)]

        # Printing results
        print(f'\nThe best model hyperparameters are changepoint_prior_scale = {best_params[0]}, and seasonality_prior_scale = {best_params[1]}.\n')  
        print(results)

        # Fitting best model again
        m = Prophet(changepoint_prior_scale=best_params[0], seasonality_prior_scale=best_params[1])
        m.add_country_holidays(country_name='MX')
        m.fit(time_series);        

        return m
In [793]:
# Defining hyperparameters values
changepoint_prior_scale = [0.001, 0.01, 0.1, 0.5]
seasonality_prior_scale = [0.01, 0.1, 1.0, 10.0]

# Fitting model
prophet_model = optimize_prophet_model(y_train, changepoint_prior_scale, seasonality_prior_scale, 'rmse')
17:49:44 - cmdstanpy - INFO - Chain [1] start processing
17:49:45 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:49:45 - cmdstanpy - INFO - Chain [1] start processing
17:49:45 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:01,  1.49it/s]17:49:46 - cmdstanpy - INFO - Chain [1] start processing
17:49:46 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:01<00:00,  1.45it/s]17:49:46 - cmdstanpy - INFO - Chain [1] start processing
17:49:46 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:01<00:00,  1.64it/s]
17:49:47 - cmdstanpy - INFO - Chain [1] start processing
17:49:47 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:49:47 - cmdstanpy - INFO - Chain [1] start processing
17:49:47 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:00,  2.66it/s]17:49:48 - cmdstanpy - INFO - Chain [1] start processing
17:49:48 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:00<00:00,  2.43it/s]17:49:48 - cmdstanpy - INFO - Chain [1] start processing
17:49:49 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:01<00:00,  1.82it/s]
17:49:49 - cmdstanpy - INFO - Chain [1] start processing
17:49:50 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:49:50 - cmdstanpy - INFO - Chain [1] start processing
17:49:50 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:01,  1.23it/s]17:49:51 - cmdstanpy - INFO - Chain [1] start processing
17:49:51 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:01<00:00,  1.50it/s]17:49:51 - cmdstanpy - INFO - Chain [1] start processing
17:49:52 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:02<00:00,  1.39it/s]
17:49:52 - cmdstanpy - INFO - Chain [1] start processing
17:49:52 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:49:53 - cmdstanpy - INFO - Chain [1] start processing
17:49:53 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:01,  1.97it/s]17:49:53 - cmdstanpy - INFO - Chain [1] start processing
17:49:54 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:01<00:00,  1.28it/s]17:49:54 - cmdstanpy - INFO - Chain [1] start processing
17:49:55 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:02<00:00,  1.37it/s]
17:49:55 - cmdstanpy - INFO - Chain [1] start processing
17:49:55 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:49:55 - cmdstanpy - INFO - Chain [1] start processing
17:49:56 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:00,  2.33it/s]17:49:56 - cmdstanpy - INFO - Chain [1] start processing
17:49:56 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:00<00:00,  2.04it/s]17:49:56 - cmdstanpy - INFO - Chain [1] start processing
17:49:57 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:01<00:00,  2.05it/s]
17:49:57 - cmdstanpy - INFO - Chain [1] start processing
17:49:57 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:49:57 - cmdstanpy - INFO - Chain [1] start processing
17:49:58 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:01,  1.10it/s]17:49:58 - cmdstanpy - INFO - Chain [1] start processing
17:49:59 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:01<00:00,  1.49it/s]17:49:59 - cmdstanpy - INFO - Chain [1] start processing
17:49:59 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:02<00:00,  1.38it/s]
17:50:00 - cmdstanpy - INFO - Chain [1] start processing
17:50:00 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:50:00 - cmdstanpy - INFO - Chain [1] start processing
17:50:01 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:00,  2.38it/s]17:50:01 - cmdstanpy - INFO - Chain [1] start processing
17:50:01 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:00<00:00,  2.10it/s]17:50:01 - cmdstanpy - INFO - Chain [1] start processing
17:50:02 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:01<00:00,  2.08it/s]
17:50:02 - cmdstanpy - INFO - Chain [1] start processing
17:50:02 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:50:02 - cmdstanpy - INFO - Chain [1] start processing
17:50:03 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:01,  1.74it/s]17:50:03 - cmdstanpy - INFO - Chain [1] start processing
17:50:03 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:01<00:00,  1.87it/s]17:50:04 - cmdstanpy - INFO - Chain [1] start processing
17:50:04 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:01<00:00,  1.88it/s]
17:50:04 - cmdstanpy - INFO - Chain [1] start processing
17:50:05 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:50:05 - cmdstanpy - INFO - Chain [1] start processing
17:50:05 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:00,  2.44it/s]17:50:05 - cmdstanpy - INFO - Chain [1] start processing
17:50:05 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:00<00:00,  2.34it/s]17:50:06 - cmdstanpy - INFO - Chain [1] start processing
17:50:06 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:01<00:00,  2.07it/s]
17:50:06 - cmdstanpy - INFO - Chain [1] start processing
17:50:07 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:50:07 - cmdstanpy - INFO - Chain [1] start processing
17:50:07 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:01,  1.52it/s]17:50:08 - cmdstanpy - INFO - Chain [1] start processing
17:50:08 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:01<00:00,  1.77it/s]17:50:08 - cmdstanpy - INFO - Chain [1] start processing
17:50:09 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:01<00:00,  1.71it/s]
17:50:09 - cmdstanpy - INFO - Chain [1] start processing
17:50:09 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:50:09 - cmdstanpy - INFO - Chain [1] start processing
17:50:10 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:00,  2.03it/s]17:50:10 - cmdstanpy - INFO - Chain [1] start processing
17:50:10 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:00<00:00,  2.00it/s]17:50:10 - cmdstanpy - INFO - Chain [1] start processing
17:50:11 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:01<00:00,  1.99it/s]
17:50:11 - cmdstanpy - INFO - Chain [1] start processing
17:50:11 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:50:12 - cmdstanpy - INFO - Chain [1] start processing
17:50:12 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:00<00:01,  1.71it/s]17:50:12 - cmdstanpy - INFO - Chain [1] start processing
17:50:12 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:01<00:00,  1.96it/s]17:50:13 - cmdstanpy - INFO - Chain [1] start processing
17:50:13 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:01<00:00,  1.83it/s]
17:50:13 - cmdstanpy - INFO - Chain [1] start processing
17:50:14 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:50:14 - cmdstanpy - INFO - Chain [1] start processing
17:50:31 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:16<00:33, 16.97s/it]17:50:31 - cmdstanpy - INFO - Chain [1] start processing
17:50:31 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:17<00:07,  7.25s/it]17:50:32 - cmdstanpy - INFO - Chain [1] start processing
17:50:58 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [00:43<00:00, 14.57s/it]
17:50:58 - cmdstanpy - INFO - Chain [1] start processing
17:50:58 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:50:59 - cmdstanpy - INFO - Chain [1] start processing
17:51:19 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:20<00:41, 20.51s/it]17:51:19 - cmdstanpy - INFO - Chain [1] start processing
17:51:41 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:42<00:21, 21.37s/it]17:51:41 - cmdstanpy - INFO - Chain [1] start processing
17:52:10 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [01:11<00:00, 23.77s/it]
17:52:10 - cmdstanpy - INFO - Chain [1] start processing
17:52:10 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:52:11 - cmdstanpy - INFO - Chain [1] start processing
17:52:29 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:18<00:37, 18.93s/it]17:52:29 - cmdstanpy - INFO - Chain [1] start processing
17:52:52 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:41<00:21, 21.15s/it]17:52:52 - cmdstanpy - INFO - Chain [1] start processing
17:53:22 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [01:11<00:00, 23.99s/it]
17:53:23 - cmdstanpy - INFO - Chain [1] start processing
17:53:57 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 0/3 [00:00<?, ?it/s]17:53:57 - cmdstanpy - INFO - Chain [1] start processing
17:54:15 - cmdstanpy - INFO - Chain [1] done processing
 33%|███▎      | 1/3 [00:18<00:37, 18.61s/it]17:54:16 - cmdstanpy - INFO - Chain [1] start processing
17:54:36 - cmdstanpy - INFO - Chain [1] done processing
 67%|██████▋   | 2/3 [00:39<00:20, 20.07s/it]17:54:37 - cmdstanpy - INFO - Chain [1] start processing
17:55:04 - cmdstanpy - INFO - Chain [1] done processing
100%|██████████| 3/3 [01:06<00:00, 22.28s/it]
The best model hyperparameters are changepoint_prior_scale = 0.01, and seasonality_prior_scale = 0.1.

   Hyperparameters        Metric
0    (0.001, 0.01)   5984.533579
1     (0.001, 0.1)   5901.415336
2     (0.001, 1.0)   5906.996465
3    (0.001, 10.0)   6122.661240
4     (0.01, 0.01)   5992.136908
5      (0.01, 0.1)   5867.370102
6      (0.01, 1.0)   5884.870818
7     (0.01, 10.0)   6142.011762
8      (0.1, 0.01)   9667.252523
9       (0.1, 0.1)   9482.562980
10      (0.1, 1.0)   9295.895359
11     (0.1, 10.0)   9485.383653
12     (0.5, 0.01)  11788.038040
13      (0.5, 0.1)  10063.604415
14      (0.5, 1.0)   6934.052458
15     (0.5, 10.0)   8676.735767
17:55:04 - cmdstanpy - INFO - Chain [1] start processing
17:55:04 - cmdstanpy - INFO - Chain [1] done processing

Predictions¶

In [794]:
# Creating a DataFrame to hold the predictions from Prophet
future = y_df.drop(columns=['y'])
In [795]:
# Generating the forecast using the predict method
forecast = prophet_model.predict(future)
forecast.tail(len(y_test))
Out[795]:
ds trend yhat_lower yhat_upper trend_lower trend_upper Benito Juárez's birthday Benito Juárez's birthday_lower Benito Juárez's birthday_upper Change of Federal Government ... holidays holidays_lower holidays_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
32 2022-01-01 192027.687645 180331.610497 188558.675430 192027.686974 192027.688237 0.0 0.0 0.0 0.0 ... 244.376869 244.376869 244.376869 -7828.252084 -7828.252084 -7828.252084 0.0 0.0 0.0 184443.812430
33 2022-04-01 194618.874577 182892.036929 190970.600413 194618.872478 194618.876548 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 -7860.926269 -7860.926269 -7860.926269 0.0 0.0 0.0 186757.948308
34 2022-07-01 197238.852476 186578.635481 194617.940344 197238.848403 197238.856171 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 -6513.078378 -6513.078378 -6513.078378 0.0 0.0 0.0 190725.774097
35 2022-10-01 199887.621340 215270.147620 223532.054218 199887.614787 199887.627607 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 19668.209751 19668.209751 19668.209751 0.0 0.0 0.0 219555.831091
36 2023-01-01 202536.390204 192515.983707 200789.477281 202536.380435 202536.399257 0.0 0.0 0.0 0.0 ... 244.376869 244.376869 244.376869 -6233.331246 -6233.331246 -6233.331246 0.0 0.0 0.0 196547.435827
37 2023-04-01 205127.577136 192407.299794 200612.164590 205127.564097 205127.589224 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 -8518.680619 -8518.680619 -8518.680619 0.0 0.0 0.0 196608.896518
38 2023-07-01 207747.555035 198278.398364 206503.375948 207747.538387 207747.570458 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 -5437.311670 -5437.311670 -5437.311670 0.0 0.0 0.0 202310.243365
39 2023-10-01 210396.323899 226445.663672 234782.163151 210396.303539 210396.342612 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 20150.067102 20150.067102 20150.067102 0.0 0.0 0.0 230546.391000

8 rows × 43 columns

As can be seen in the table above, the Prophet algorithm provides a lot of details regarding the fit and its individual components such as trend, additive terms, yearly seasonality, and multiplicative terms. The predictions are hold on the $yhat$ column.

Moreover, it is possible to assess visually the forecast components of the fit using the plot_components method.

In [796]:
chart_title = 'WALMEX Net Sales Forecast Components from Prophet Model'
prophet_model.plot_components(forecast)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T17:55:10.288052 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Thus, according to the chart above, the model has detected strong seasonalities in the WALMEX net sales at the beginning and at the end of the year, which may correspond to several important holidays in Mexico such as the Independence Day and Christmas.

In [797]:
y_pred_prophet_model = (forecast[['ds','yhat']].tail(len(y_test))
                        .rename(columns={'ds':'date', 'yhat':'net_sales'})
                        .set_index('date').loc[:,'net_sales'])
y_pred_prophet_model
Out[797]:
date
2022-01-01    184443.812430
2022-04-01    186757.948308
2022-07-01    190725.774097
2022-10-01    219555.831091
2023-01-01    196547.435827
2023-04-01    196608.896518
2023-07-01    202310.243365
2023-10-01    230546.391000
Name: net_sales, dtype: float64

Model Assessment¶

The predictions were plot against the historical net sales data to visually assess the performance of the Prophet model.

In [798]:
plot_predictions(y_pred_prophet_model, 'Prophet Model')
2024-11-01T17:55:13.900055 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plot above, the Prophet model performed well as it was able to capture both the trend and seasonality of the WALMEX net sales series.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:

In [799]:
prophet_rmse, prophet_mae, prophet_r2 = calculate_scores(pd.DataFrame(y_pred_prophet_model))
net_sales
RMSE: 12323.068
MAE: 10710.708
Coefficient of Determination: 0.649

In [800]:
# Adding scores to lists
rmse_list.append(prophet_rmse[0])
mae_list.append(prophet_mae[0])
r2_list.append(prophet_r2[0])

6.11 Vector Autoregressive (VAR) Model ¶

Modeling Technique ¶

As shown above, the original dataset comprised $40$ historical data points about $7$ features (economic indicators) and $1$ response variable (net sales of WALMEX in millions of MXN). However, based on the results from the Johanssen's cointegration test, only the series units, SP&500, and WALMEX shared enough the same stochastic trend to be used in a multivariate times series model.

A first model for multivariate times series prediction was built using a Vector Autoregression (VAR) model for forecasting the $3$ selected features. A VAR model was selected due to its ability to forecast multiple features in an easy manner by using the library statsmodels in Python (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

The VAR($p$) model describes the relationship among two or more time series and the impact that their past values have on each other (Peixeiro, 2022). In this sense, the VAR($p$) model is a generalization of the AR model for multiple time series, and the order $p$ determines how many lagged values impact the present value of the different time series (Peixeiro, 2022).

Several orders of $p$ were tested and the best performing model was selected according to the Akaike Information Criterion (AIC). Then, the Granger causality test was applied, which determines whether past values of a time series are statistically significant in forecasting another time series. Then, a residual analysis was finally carried out to assess whether the residuals were close to white noise (Peixeiro, 2022).

The function VAR from the library statsmodels in Python was used to build the VAR model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).

Modeling Assumptions ¶

It is assumed that the features are cointegrated, i.e., the different features share an underlying stochastic trend, and that the linear combination of the different features is stationary (Clower, 2020).

Moreover, it was assumed that the current values are linearly dependent on their past values (Peixeiro, 2022); and that the different time series were stationary, they were not a random walk, and seasonality effects were not relevant for modeling.

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [801]:
X_model = X[['units', 'sp500', 'walmex']]
X_diff_model = X_diff[['units', 'sp500', 'walmex']]

cutoff = int(0.8*(len(X_diff_model)))

X_train = X_diff_model[:cutoff + 1]
X_test = X_diff_model[cutoff:]

Model Building¶

Later, the VAR model was built using the class VAR from the statsmodels library.

In [806]:
def optimize_var_model(p: List[int], time_series: pd.DataFrame) -> VAR:
        """
        Optimize a Vector Autoregression (VAR) model given a set of lags (p) by minimizing the lowest Akaike Information Criterion (AIC).

        Parameters:
        p (List[int]): Lag values.
        time_series (pandas.DataFrame): Time series data.

        Returns:
        var_model (statsmodels.tsa.vector_ar.var_model.VAR): VAR model object optimized according to the AIC criterion.

        """

        # Creating empty lists to store results

        aic_results = []

        for lag in p:
                var_model = VAR(endog=time_series).fit(maxlags=lag)
                aic_results.append(var_model.aic)
        
        # Converting lists to dataframes
        results = pd.DataFrame({'(p)': p,
                                'AIC': aic_results                                
                                })        
        # Storing results from the best model
        lowest_aic = results.AIC.min()
        best_model = results.loc[results['AIC'] == lowest_aic, ['(p)']].values[0][0]

        # Printing results
        print(f'The best model is (p = {best_model}), with an AIC of {lowest_aic:.02f}.\n')         
        print(results)     

        # Fitting best model again
        var_model = VAR(endog=time_series).fit(maxlags=best_model)

        return var_model               
In [807]:
p = list(range(1,6))

var_model = optimize_var_model(p=p, time_series=X_train)
The best model is (p = 1), with an AIC of 19.86.

   (p)        AIC
0    1  19.864087
1    2  20.150298
2    3  20.270986
3    4  20.398069
4    5  20.817201
In [808]:
var_model.summary()
Out[808]:
  Summary of Regression Results   
==================================
Model:                         VAR
Method:                        OLS
Date:           Fri, 01, Nov, 2024
Time:                     17:59:08
--------------------------------------------------------------------
No. of Equations:         3.00000    BIC:                    20.4192
Nobs:                     31.0000    HQIC:                   20.0450
Log likelihood:          -427.855    FPE:                4.25352e+08
AIC:                      19.8641    Det(Omega_mle):     2.95549e+08
--------------------------------------------------------------------
Results for equation units
============================================================================
               coefficient       std. error           t-stat            prob
----------------------------------------------------------------------------
const            20.022032        12.301704            1.628           0.104
L1.units         -0.164855         0.180112           -0.915           0.360
L1.sp500          0.000551         0.076224            0.007           0.994
L1.walmex         5.863524         3.506584            1.672           0.094
============================================================================

Results for equation sp500
============================================================================
               coefficient       std. error           t-stat            prob
----------------------------------------------------------------------------
const            63.703334        32.214811            1.977           0.048
L1.units          0.215674         0.471664            0.457           0.647
L1.sp500          0.309229         0.199610            1.549           0.121
L1.walmex        -9.867439         9.182789           -1.075           0.283
============================================================================

Results for equation walmex
============================================================================
               coefficient       std. error           t-stat            prob
----------------------------------------------------------------------------
const             0.600297         0.662098            0.907           0.365
L1.units          0.006522         0.009694            0.673           0.501
L1.sp500          0.006112         0.004103            1.490           0.136
L1.walmex         0.014669         0.188730            0.078           0.938
============================================================================

Correlation matrix of residuals
             units     sp500    walmex
units     1.000000  0.129691 -0.082913
sp500     0.129691  1.000000  0.221034
walmex   -0.082913  0.221034  1.000000

Granger Causality Test¶

The Granger causality test was applied to determine whether past values of a time series are statistically significant in forecasting another time series (Peixeiro, 2022). This test is unidirectional, so it has to be applied twice for each series.

$$H_{0}: y_{2,t}\text{ does not Granger-cause }y_{1,t}$$$$H_{1}: y_{2,t}\text{ Granger-cause }y_{1,t}$$$$\alpha = 0.95$$
In [809]:
def test_granger_causality(df: pd.DataFrame, lag: int, alpha: float) -> pd.DataFrame:
        """
        Performs the Granger causality tests for each pair of stationary time series, in both directions y2 --> y1 and y1 --> y2.

        Parameters:
        df (pandas.DataFrame): Stationary time series data.
        lag (int): Lag for whose value the test is computed. 
        alpha (float): Desired significance level for the hypothesis test.

        Returns:
        summary (pandas.DataFrame): Summary results from the Granger causality tests.

        """
                
        # Creating empty lists
        y2 = []
        y1 = []
        interpretation = []
        ssr_ftest_pvalues = []
        ssr_chi2test_pvalues = []
        lrtest_pvalues = []
        params_ftest_pvalues = []
        gc = []

        # Iterating over series biderectionally
        arrays = list(combinations_with_replacement(df.columns, 2))

        for array in arrays:

                if array[1] == array[0]:
                        continue
                
                # y2 --> y1
                print(f'{array[1]} --> {array[0]}')
                result = grangercausalitytests(df[[array[0],array[1]]], maxlag=[lag])
                ssr_ftest_pvalue = result[lag][0]['ssr_ftest'][1]
                ssr_chi2test_pvalue = result[lag][0]['ssr_chi2test'][1]
                lrtest_pvalue = result[lag][0]['lrtest'][1]
                params_ftest_pvalue = result[lag][0]['params_ftest'][1]

                y2.append(array[1])
                y1.append(array[0])

                ssr_ftest_pvalues.append(round(ssr_ftest_pvalue,4))
                ssr_chi2test_pvalues.append(round(ssr_chi2test_pvalue, 4))
                lrtest_pvalues.append(round(lrtest_pvalue, 4))
                params_ftest_pvalues.append(round(params_ftest_pvalue, 4))

                # Test interpretation
                if ssr_ftest_pvalue <= alpha and ssr_chi2test_pvalue <= alpha and lrtest_pvalue <= alpha and params_ftest_pvalue <= alpha:
                        result = f'{array[1]} Granger-causes {array[0]}.'
                        print(f'\n{result}\n\n')
                        interpretation.append(result)
                        gc.append('True')
                
                elif ssr_ftest_pvalue > alpha and ssr_chi2test_pvalue > alpha and lrtest_pvalue > alpha and params_ftest_pvalue > alpha:
                        result = f'{array[1]} does not Granger-causes {array[0]}.'
                        print(f'\n{result}\n\n')
                        interpretation.append(result)
                        gc.append('False')

                else:
                        result = f'Inconsistent results among the tests.'
                        print(f'\n{result}\n\n')
                        interpretation.append(result)
                        gc.append('Uncertain')
                

                # y1 --> y2
                print(f'{array[0]} --> {array[1]}')
                result = grangercausalitytests(df[[array[1],array[0]]], maxlag=[lag])
                ssr_ftest_pvalue = result[lag][0]['ssr_ftest'][1]
                ssr_chi2test_pvalue = result[lag][0]['ssr_chi2test'][1]
                lrtest_pvalue = result[lag][0]['lrtest'][1]
                params_ftest_pvalue = result[lag][0]['params_ftest'][1]

                y2.append(array[0])
                y1.append(array[1])

                ssr_ftest_pvalues.append(round(ssr_ftest_pvalue, 4))
                ssr_chi2test_pvalues.append(round(ssr_chi2test_pvalue, 4))
                lrtest_pvalues.append(round(lrtest_pvalue, 4))
                params_ftest_pvalues.append(round(params_ftest_pvalue, 4))

                # Test interpretation
                if ssr_ftest_pvalue <= alpha and ssr_chi2test_pvalue <= alpha and lrtest_pvalue <= alpha and params_ftest_pvalue <= alpha:
                        result = f'{array[0]} Granger-causes {array[1]}.'
                        print(f'\n{result}\n\n')
                        interpretation.append(result)
                        gc.append('True')
                
                elif ssr_ftest_pvalue > alpha and ssr_chi2test_pvalue > alpha and lrtest_pvalue > alpha and params_ftest_pvalue > alpha:
                        result = f'{array[0]} does not Granger-causes {array[1]}.'
                        print(f'\n{result}\n\n')
                        interpretation.append(result)
                        gc.append('False')

                else:
                        result = f'Inconsistent results among the tests.'
                        print(f'\n{result}\n\n')
                        interpretation.append(result)
                        gc.append('Uncertain')

        # Building dataframe with summary results
        results_dict = {'y2':y2, 'y1':y1, 'Granger Causality':gc, 
                        'Test Interpretation':interpretation,
                        'SSR F-test p-values': ssr_ftest_pvalues,
                        'SSR Chi2-test p-values':ssr_chi2test_pvalues,
                        'LR-test p-values': lrtest_pvalues,
                        'Params F-test p-values': params_ftest_pvalues
                        }

        summary = pd.DataFrame(results_dict)

        return summary
In [810]:
granger_causality_summary = test_granger_causality(df=X_diff_model, lag=1, alpha=0.05)
granger_causality_summary
sp500 --> units

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.3146  , p=0.5784  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=0.3416  , p=0.5589  , df=1
likelihood ratio test: chi2=0.3401  , p=0.5598  , df=1
parameter F test:         F=0.3146  , p=0.5784  , df_denom=35, df_num=1

sp500 does not Granger-causes units.


units --> sp500

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.4924  , p=0.4875  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=0.5346  , p=0.4647  , df=1
likelihood ratio test: chi2=0.5309  , p=0.4662  , df=1
parameter F test:         F=0.4924  , p=0.4875  , df_denom=35, df_num=1

units does not Granger-causes sp500.


walmex --> units

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=1.4071  , p=0.2435  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=1.5277  , p=0.2165  , df=1
likelihood ratio test: chi2=1.4978  , p=0.2210  , df=1
parameter F test:         F=1.4071  , p=0.2435  , df_denom=35, df_num=1

walmex does not Granger-causes units.


units --> walmex

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.7256  , p=0.4001  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=0.7878  , p=0.3748  , df=1
likelihood ratio test: chi2=0.7797  , p=0.3772  , df=1
parameter F test:         F=0.7256  , p=0.4001  , df_denom=35, df_num=1

units does not Granger-causes walmex.


walmex --> sp500

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=1.0120  , p=0.3213  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=1.0988  , p=0.2945  , df=1
likelihood ratio test: chi2=1.0832  , p=0.2980  , df=1
parameter F test:         F=1.0120  , p=0.3213  , df_denom=35, df_num=1

walmex does not Granger-causes sp500.


sp500 --> walmex

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=1.6312  , p=0.2099  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=1.7710  , p=0.1833  , df=1
likelihood ratio test: chi2=1.7310  , p=0.1883  , df=1
parameter F test:         F=1.6312  , p=0.2099  , df_denom=35, df_num=1

sp500 does not Granger-causes walmex.


Out[810]:
y2 y1 Granger Causality Test Interpretation SSR F-test p-values SSR Chi2-test p-values LR-test p-values Params F-test p-values
0 sp500 units False sp500 does not Granger-causes units. 0.5784 0.5589 0.5598 0.5784
1 units sp500 False units does not Granger-causes sp500. 0.4875 0.4647 0.4662 0.4875
2 walmex units False walmex does not Granger-causes units. 0.2435 0.2165 0.2210 0.2435
3 units walmex False units does not Granger-causes walmex. 0.4001 0.3748 0.3772 0.4001
4 walmex sp500 False walmex does not Granger-causes sp500. 0.3213 0.2945 0.2980 0.3213
5 sp500 walmex False sp500 does not Granger-causes walmex. 0.2099 0.1833 0.1883 0.2099

Even though, the results from the Granger-causality tests suggested that it is not possible to reject the null hypothesis that the series does not Granger-causes each other, thus rendering the VAR model invalid, it was decided to go further with the model.

Residual Analysis¶

The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).

To do so, the residual analysis is performed in two steps (Peixeiro, 2022):

  • Quantile-quantile plot (Q-Q plot): To qualitatively assess whether the residuals from the model are normally distributed.
  • Ljung-Box test: To test whether the residuals from the model are uncorrelated.
In [811]:
# Storing residuals
sc_X = StandardScaler()
residuals = sc_X.fit_transform(var_model.resid)
In [812]:
base_chart_title = 'Q-Q plot for standardized residuals for '

cols = X_model.columns.values

# Loop
pos = 1
for i in range(0,len(cols)):
    # Plots
    plt.figure(figsize=(8,15))
    ax = plt.subplot(len(cols), 1, pos)    
    sm.qqplot(residuals[:,i], line ='45', ax=ax)
    chart_title = base_chart_title + cols[i] + ' series from VAR model'
    plt.title(chart_title)
    plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
    pos=+1


plt.show()
2024-11-01T17:59:19.067318 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:59:19.378320 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:59:19.678320 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the plot above, it is clear that the standardized residuals for the series units and SP&500 did not follow a normal distribution, which means that the VAR model was not able to capture some information from the data. Despite this result, the analysis continued below.

Then, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:

$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$
In [813]:
# Ljung-Box test of the residuals on 10 lags
cols = X_model.columns.values

for i in range(0, len(cols)):
    print(cols[i])
    print(acorr_ljungbox(residuals[:,i], np.arange(1, 11)))
    print("\n")
units
     lb_stat  lb_pvalue
1   0.031080   0.860061
2   1.095988   0.578108
3   1.757447   0.624238
4   3.141998   0.534350
5   4.402226   0.493071
6   4.764572   0.574344
7   5.555494   0.592499
8   7.817132   0.451534
9   8.552430   0.479570
10  8.901256   0.541500


sp500
     lb_stat  lb_pvalue
1   0.201141   0.653801
2   0.878931   0.644381
3   0.918750   0.820901
4   2.046825   0.727147
5   2.403074   0.791016
6   3.670298   0.721191
7   3.944434   0.786155
8   5.081352   0.748847
9   5.760660   0.763613
10  6.624347   0.760369


walmex
     lb_stat  lb_pvalue
1   0.296180   0.586287
2   1.978856   0.371789
3   2.822988   0.419730
4   3.989120   0.407480
5   4.279314   0.509941
6   6.838359   0.336055
7   7.050704   0.423620
8   9.247222   0.321872
9   9.670043   0.377851
10  9.824165   0.456053


For each of the lags from 1 to 10, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the VAR model. Thus, the residuals are independently distributed and the model could be used for forecasting.

Predictions¶

The model was then used to forecast the features in the dataset, using the length of the testing set.

In [814]:
# Forecasting
X_pred_var_model = var_model.forecast(var_model.endog, steps=len(X_test))
X_pred_var_model
Out[814]:
array([[25.9666079 , 10.59689735, -0.14430718],
       [14.90100107, 74.00447175,  0.83230058],
       [22.48650828, 81.58875412,  1.16197562],
       [23.17323618, 82.31698421,  1.26263712],
       [23.65065798, 81.69701243,  1.2730433 ],
       [23.63262811, 81.50558432,  1.27252074],
       [23.63243097, 81.44765688,  1.27122556],
       [23.62483721, 81.44248167,  1.27085125]])
In [815]:
# Bringing predictions to original scale
for i in range(0, len(X_model.columns)):
    X_pred_var_model[:,i] = X_model[:cutoff+1].values[-1,i] + X_pred_var_model[:,i].cumsum()
In [816]:
# Converting array of predictions into a dataframe
features_names = X_model.columns
index_labels = X_diff_model[cutoff:].index
X_pred_var_model = pd.DataFrame(data=X_pred_var_model, index=index_labels, columns=features_names)
X_pred_var_model
Out[816]:
units sp500 walmex
date
2022-01-01 3645.966608 4610.943987 72.843919
2022-04-01 3660.867609 4684.948458 73.676219
2022-07-01 3683.354117 4766.537212 74.838195
2022-10-01 3706.527353 4848.854197 76.100832
2023-01-01 3730.178011 4930.551209 77.373875
2023-04-01 3753.810640 5012.056793 78.646396
2023-07-01 3777.443070 5093.504450 79.917621
2023-10-01 3801.067908 5174.946932 81.188473

Model Assessment¶

The predictions from the VAR model were plot against the time series to visualize their accuracy.

In [817]:
def plot_multivariate_predictions_in_single_chart(train: pd.DataFrame, test: pd.DataFrame, predictions: pd.DataFrame, plot_title: str, plot_size: Tuple[int, int] = (8,7)) -> None:
        """
        Plots the training, testing and predictions sets for multivariate time series in a single visual.

        Parameters:
        train (pandas.DataFrame): Training set for the multiple time series in their original scale.
        test (pandas.DataFrame): Testing set for the multiple time series in their original scale.
        predictions (pandas.DataFrame): Prediction set for the multiple time series in their original scale.
        plot_title (str): Title for saving the plot
        plot_size (Tuple[int, int]):  Tuple for the figure size. (8,7) by default.

        Returns:
        None
        """ 

        # Colors
        time_series_color=sns.color_palette('Blues_r')[0]
        test_color = 'green'
        pred_color = '#C70039'

        # Data
        cols = train.columns.values

        # Fig overall size
        plt.figure(figsize=plot_size)

        # Loop
        i = 1
        for col in cols:    
            # Plots
            plt.subplot(len(cols), 1, i)        
            ax = sns.lineplot(train[col], color=time_series_color, lw=1, zorder=2)
            sns.lineplot(test[col], color=test_color, lw=1, linestyle='dashed', zorder=1)
            sns.lineplot(predictions[col], color=pred_color, lw=2, zorder=3)

            # Adding shadow to predictions
            first_pred_x = predictions.index[0]
            last_pred_x = predictions.index[-1]
            ax.axvspan(first_pred_x, last_pred_x, color='#808080', alpha=0.2)    

            # Removing grid
            plt.grid(False)

            # Title
            plt.title(col, y=0.8, loc='left')    

            # Removing labels and ticks
            plt.ylabel('')
            plt.xlabel('')
            #plt.yticks([])
            #plt.xticks([])
            i += 1

        # Adding a legend
        legend_x = 1.12 
        legend_y = 1.15 * len(cols)
        legend_lines = [Line2D([0], [0], color=time_series_color, lw=2),
                        Line2D([0], [0], color=test_color, lw=2, linestyle='dashed'),
                        Line2D([0], [0], color=pred_color, lw=2)]
        plt.legend(legend_lines, 
                   ['Training', 'Testing', 'Predictions'], 
                   loc='upper center', 
                   bbox_to_anchor=(legend_x, legend_y),
                   facecolor='#ececec',
                   frameon = True)


        # Adjusting ticks in plot
        # max_xticks = 12
        # xloc = plt.MaxNLocator(max_xticks)
        # ax.xaxis.set_major_locator(xloc)
        # plt.xticks(rotation=45)

        # Adding a x label
        plt.xlabel('Date')

        title = "Predictions from " + plot_title + " vs. Historical Time Series"
        plt.savefig(f'Images/fig_{title.lower().replace(" ", "_").replace(".", "")}.png',  bbox_inches = 'tight')
        plt.show()
In [818]:
plot_multivariate_predictions_in_single_chart(train=X_model[:cutoff + 2], 
                                            test=X_model[cutoff + 1:], 
                                            predictions=X_pred_var_model,
                                            plot_title='VAR Model')
2024-11-01T17:59:22.461335 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
In [819]:
def plot_multivariate_predictions_in_multiple_charts(train: pd.DataFrame, test: pd.DataFrame, predictions: pd.DataFrame, plot_size: Tuple[int, int] = (8,5)) -> None:
        """
        Plots the time series along with its correspondent predictions in an individual chart each.
        Train, test and prediction parameters must have the same columns.      

        Parameters:        
        train (pandas.DataFrame): Pandas dataframe with the training portion of the time series.
        test (pandas.DataFrame): Pandas dataframe with the testing portion of the time series.
        predictions (pandas.DataFrame): Pandas dataframe with the predictions for the time series.
        plot_size (Tuple[int, int]): Tuple for the figure size. (8,5) by default.

        Returns:
        None

        """
        columns = list(train.columns)
        train_color=sns.color_palette('Blues_r')[0]
        test_color='green'
        pred_color = '#C70039'

        for col in columns:

            # Drawing plots  
            fig, ax = plt.subplots(figsize=plot_size)
            sns.lineplot(train[col], ax=ax, color=train_color, lw=1)
            sns.lineplot(test[col], ax=ax, color=test_color, lw=1, linestyle='dashed')
            sns.lineplot(predictions[col], ax=ax, color=pred_color, lw=2)

            # Adding shadow to predictions
            first_pred_x = test.index[0]
            last_pred_x = test.index[-1]
            ax.axvspan(first_pred_x, last_pred_x, color='#808080', alpha=0.2)    

            # Adding labels to plot
            label = str(col).title().replace('_',' ')
            chart_title = f'Predicted {label} Over Time'
            plt.title(chart_title)
            plt.ylabel(f'{label}')
            plt.xlabel('Time')      

            # Adding legend to plot
            legend_lines = [Line2D([0], [0], color=train_color, lw=2),
                            Line2D([0], [0], color=test_color, lw=2, linestyle='dashed'),                            
                            Line2D([0], [0], color=pred_color, lw=2)]
            plt.legend(legend_lines, ['Training', 'Testing', 'Predictions'], loc='upper left', facecolor='white', frameon=True)

            plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
            plt.show()
In [820]:
plot_multivariate_predictions_in_multiple_charts(train=X_model[:cutoff + 2], 
                                                test=X_model[cutoff + 1:], 
                                                predictions=X_pred_var_model)
2024-11-01T17:59:24.582320 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:59:26.693319 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T17:59:29.138318 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plots above, it can be seen that the VAR model was not able to yield accurate predictions. The only economic indicator whose predictions were more or less accurately was units. However, it is important to bear in mind that the COVID pandemic disrupted the historical trends not only in Mexico, but in the entire world. So, fairly, it was very dificult to expect that the VAR model could provide accurate predictions for the years 2021-2023 using the data from 2014-2020 for training.

The RMSE, MAE, and $\bf{r^{2}}$ score were calculated for each feature as follows:

In [821]:
var_rmse, var_mae, var_r2 = calculate_scores(predictions=X_pred_var_model, actuals=X_model[cutoff + 1:])
units
RMSE: 41.684
MAE: 30.270
Coefficient of Determination: 0.750

sp500
RMSE: 741.698
MAE: 699.704
Coefficient of Determination: -9.352

walmex
RMSE: 7.841
MAE: 6.134
Coefficient of Determination: -7.682

In [822]:
# Adding scores to lists
X_rmse_list.append(var_rmse)
X_mae_list.append(var_mae)
X_r2_list.append(var_r2)

6.12 Vector Autoregressive Moving Average (VARMA) Model ¶

Modeling Technique ¶

As shown above, the original dataset comprised $40$ historical data points about $7$ features (economic indicators) and $1$ response variable (net sales of WALMEX in millions of MXN). However, based on the results from the Johanssen's cointegration test, only the series units, SP&500, and WALMEX shared enough the same underlying stochastic trend to be used in a multivariate times series model.

In this sense, a second model for multivariate times series prediction was built using a Vector Autoregressive Moving Average (VARMA) model for forecasting the $3$ selected features. A VARMA model was selected as a generalization of the VAR model to include a moving average process (Peixeiro, 2022).

The VARMA($p$, $q$) model describes the relationship among two or more time series and the impact that their past values and past error terms have on each other (Peixeiro, 2022). Similar to the VAR model, the VARMA($p$, $q$) model is a generalization of the ARMA model for multiple time series, where the order $p$ determines how many lagged values impact the present value of the different time series, and the order $q$ determines the number of past error terms that affect their present values.

Several orders of $p$ and $q$ were tested and the best performing model was selected according to the Akaike Information Criterion (AIC). Then, the Granger causality test was applied to determine whether past values of a time series are statistically significant in forecasting another time series. Then, a residual analysis was finally carried out to assess whether the residuals were normally and independently distributed (Peixeiro, 2022).

The function VARMAX from the library statsmodels in Python was used to build the VARMA model (Peixeiro, 2022).

Modeling Assumptions ¶

It is assumed that the features are cointegrated, i.e., the different features share an underlying stochastic trend, and that the linear combination of the different features is stationary (Clower, 2020).

Moreover, it was assumed that the current values are linearly dependent on their past values and past error terms (Peixeiro, 2022); and that the different time series were stationary, they were not a random walk, and seasonality effects were not relevant for modeling.

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [823]:
selected_features = ['units', 'sp500', 'walmex']

X_model = X[selected_features]
X_diff_model = X_diff[selected_features]

cutoff = int(0.8*(len(X_diff_model)))

X_train = X_diff_model[:cutoff + 1]
X_test = X_diff_model[cutoff:]

Model Building¶

Later, the VARMA model was built using the class VARMAX from the statsmodels library.

In [824]:
def optimize_varma_model(p: List[int], q: List[int], time_series: pd.DataFrame) -> VARMAX:
        """
        Optimize a Vector Autoregressive Moving Average (VARMA) model by minimizing the Akaike Information Criterion (AIC),
        based on the provided combinations of p and q.

        Parameters:
        p (List[int]): Orders for the autoregressive process of the model.
        q (List[int]): Orders for the moving average process of the model.
        series (pandas.DataFrame): Data for the time series for fitting the model.

        Returns:
        varma_model (statsmodels.tsa.statespace.varmax.VARMAX): A VARMAX object fitted according to the combination of p and q that minimizes 
        the Akaike Information Criterion (AIC).

        """
        
        # Obtaining the combinations of p and q
        order_list = list(product(p, q))

        # Creating emtpy lists to store results
        order_results = []
        aic_results = []

        # Fitting models
        for order in order_list:

                varma_model = VARMAX(endog=time_series, order=(order[0], order[1])).fit()
                order_results.append(order)
                aic_results.append(varma_model.aic)
        
        # Converting lists to dataframes
        results = pd.DataFrame({'(p,q)': order_results,
                                'AIC': aic_results                                
                                })        
        
        # Storing results from the best model
        lowest_aic = results.AIC.min()
        best_model = results.loc[results['AIC'] == lowest_aic, ['(p,q)']].values[0][0]

        # Printing results
        print(f'The best model is (p = {best_model[0]}, q = {best_model[1]}), with an AIC of {lowest_aic:.02f}.\n')         
        print(results)     

        # Fitting best model again
        varma_model = VARMAX(endog=time_series, order=(best_model[0], best_model[1])).fit()

        return varma_model     
In [825]:
p_list = [1,2,3,4]
q_list = [1,2,3,4]

varma_model = optimize_varma_model(p=p_list, q=q_list, time_series=X_train)
The best model is (p = 1, q = 1), with an AIC of 929.44.

     (p,q)         AIC
0   (1, 1)  929.443620
1   (1, 2)  937.121768
2   (1, 3)  949.256778
3   (1, 4)  953.577824
4   (2, 1)  939.206120
5   (2, 2)  954.181567
6   (2, 3)  954.246749
7   (2, 4)  963.524345
8   (3, 1)  945.146871
9   (3, 2)  957.571176
10  (3, 3)  957.117134
11  (3, 4)  964.511917
12  (4, 1)  952.156263
13  (4, 2)  952.746762
14  (4, 3)  967.041897
15  (4, 4)  980.652654

So, the best model according to the AIC corresponds to VARMA($1$, $1$).

Granger Causality Test¶

The Granger causality test was applied to determine whether past values of a time series are statistically significant in forecasting another time series (Peixeiro, 2022). This test is unidirectional, so it has to be applied twice for each series.

$$H_{0}: y_{2,t}\text{ does not Granger-cause }y_{1,t}$$$$H_{1}: y_{2,t}\text{ Granger-cause }y_{1,t}$$$$\alpha = 0.95$$
In [826]:
gc_summary = test_granger_causality(df=X_diff_model, lag=1, alpha=0.05)
gc_summary
sp500 --> units

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.3146  , p=0.5784  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=0.3416  , p=0.5589  , df=1
likelihood ratio test: chi2=0.3401  , p=0.5598  , df=1
parameter F test:         F=0.3146  , p=0.5784  , df_denom=35, df_num=1

sp500 does not Granger-causes units.


units --> sp500

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.4924  , p=0.4875  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=0.5346  , p=0.4647  , df=1
likelihood ratio test: chi2=0.5309  , p=0.4662  , df=1
parameter F test:         F=0.4924  , p=0.4875  , df_denom=35, df_num=1

units does not Granger-causes sp500.


walmex --> units

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=1.4071  , p=0.2435  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=1.5277  , p=0.2165  , df=1
likelihood ratio test: chi2=1.4978  , p=0.2210  , df=1
parameter F test:         F=1.4071  , p=0.2435  , df_denom=35, df_num=1

walmex does not Granger-causes units.


units --> walmex

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.7256  , p=0.4001  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=0.7878  , p=0.3748  , df=1
likelihood ratio test: chi2=0.7797  , p=0.3772  , df=1
parameter F test:         F=0.7256  , p=0.4001  , df_denom=35, df_num=1

units does not Granger-causes walmex.


walmex --> sp500

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=1.0120  , p=0.3213  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=1.0988  , p=0.2945  , df=1
likelihood ratio test: chi2=1.0832  , p=0.2980  , df=1
parameter F test:         F=1.0120  , p=0.3213  , df_denom=35, df_num=1

walmex does not Granger-causes sp500.


sp500 --> walmex

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=1.6312  , p=0.2099  , df_denom=35, df_num=1
ssr based chi2 test:   chi2=1.7710  , p=0.1833  , df=1
likelihood ratio test: chi2=1.7310  , p=0.1883  , df=1
parameter F test:         F=1.6312  , p=0.2099  , df_denom=35, df_num=1

sp500 does not Granger-causes walmex.


Out[826]:
y2 y1 Granger Causality Test Interpretation SSR F-test p-values SSR Chi2-test p-values LR-test p-values Params F-test p-values
0 sp500 units False sp500 does not Granger-causes units. 0.5784 0.5589 0.5598 0.5784
1 units sp500 False units does not Granger-causes sp500. 0.4875 0.4647 0.4662 0.4875
2 walmex units False walmex does not Granger-causes units. 0.2435 0.2165 0.2210 0.2435
3 units walmex False units does not Granger-causes walmex. 0.4001 0.3748 0.3772 0.4001
4 walmex sp500 False walmex does not Granger-causes sp500. 0.3213 0.2945 0.2980 0.3213
5 sp500 walmex False sp500 does not Granger-causes walmex. 0.2099 0.1833 0.1883 0.2099

Even though, the results from the Granger-causality tests suggested that it is not possible to reject the null hypothesis that the series does not Granger-causes each other, thus rendering the VARMA model invalid, it was decided to go further with the model.

Residual Analysis¶

The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).

To do so, the residual analysis is performed in two steps (Peixeiro, 2022):

  • Quantile-quantile plot (Q-Q plot): To qualitatively assess whether the residuals from the model are normally distributed.
  • Ljung-Box test: To test whether the residuals from the model are uncorrelated.
In [827]:
# Storing residuals
sc_X = StandardScaler()
residuals = sc_X.fit_transform(varma_model.resid)
In [828]:
# Plotting QQ-Plot for units
chart_title = 'Q-Q plot for standardized residuals for unit series from VARMA model'
varma_model.plot_diagnostics(figsize=(10, 8), variable=0)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T18:02:27.798996 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the plot above, it seems that the standardized residuals for the units series have no trend and constant variance. The histogram also resembles a normal distribution. Moreover, the Q-Q plot shows a loosely straight line, although some curvature at the extremities can be seen. Finally, the correlogram shows no significant coefficients. Thus, it is possible to conclude that the residuals are somewhat close to white noise.

In [829]:
# Plotting QQ-Plot for SP&500
chart_title = 'Q-Q plot for standardized residuals for SP&500 series from VARMA model'
varma_model.plot_diagnostics(figsize=(10, 8), variable=1)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T18:02:32.043190 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the plot above, it seems that the standardized residuals for the SP&500 series have no trend and constant variance. The histogram also fairly resembles a normal distribution. Moreover, the Q-Q plot shows a fairly straight line despite curvature at the extremities can be seen. Finally, the correlogram shows no significant coefficients. Thus, it is possible to conclude that the residuals are close to white noise.

In [830]:
# Plotting QQ-Plot for WALMEX
chart_title = 'Q-Q plot for standardized residuals for WALMEX series from VARMA model'
varma_model.plot_diagnostics(figsize=(10, 8), variable=2)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T18:02:36.601008 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the plot above, it seems that the standardized residuals for the WALMEX series have no trend and constant variance. The histogram also fairly resembles a normal distribution. Moreover, the Q-Q plot shows a fairly straight line. Finally, the correlogram shows no significant coefficients. Thus, it is possible to conclude that the residuals are close to white noise.

Finally, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:

$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$
In [831]:
# Ljung-Box test of the residuals on 10 lags
cols = X_model.columns.values

for i in range(0, len(cols)):
    print(cols[i])
    print(acorr_ljungbox(residuals[:,i], np.arange(1, 11)))
    print("\n")
units
      lb_stat  lb_pvalue
1    0.039994   0.841493
2    1.689027   0.429766
3    2.133470   0.545171
4    3.062334   0.547449
5    4.723216   0.450584
6    5.216527   0.516357
7    6.088058   0.529506
8    9.093956   0.334432
9    9.647166   0.379797
10  10.136544   0.428597


sp500
     lb_stat  lb_pvalue
1   0.265940   0.606068
2   0.338502   0.844297
3   0.723513   0.867660
4   2.604584   0.626011
5   2.995088   0.700743
6   3.297182   0.770728
7   3.396420   0.846072
8   5.970563   0.650529
9   6.945731   0.642770
10  8.531878   0.577029


walmex
      lb_stat  lb_pvalue
1    0.016821   0.896808
2    0.530911   0.766857
3    1.258959   0.738901
4    3.589527   0.464397
5    4.365721   0.498048
6    5.651316   0.463360
7    9.178720   0.240078
8    9.653880   0.290164
9    9.935129   0.355762
10  10.193264   0.423704


For each of the lags from 1 to 10, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the VARMA model. Thus, the residuals are independently distributed and the model could be used for forecasting.

Predictions¶

The model was then used to forecast the features in the dataset, using the length of the testing set.

In [832]:
# Forecasting
X_pred_varma_model = varma_model.get_prediction(start=0, end=len(X_test) - 1).predicted_mean.values
X_pred_varma_model
Out[832]:
array([[ 2.63279862e+01,  9.43190298e+01,  1.44490053e+00],
       [ 3.86450775e+01,  5.98082098e+01,  1.23995013e+00],
       [ 2.25953637e+01,  8.17870839e+01,  2.24825348e-01],
       [ 1.47676179e+00,  1.30564027e+02,  1.87963976e+00],
       [ 4.07761480e+01,  8.01085645e+01,  2.82799959e+00],
       [ 4.07791124e+01,  3.81162321e+01,  1.19670208e-01],
       [ 3.11122408e+01,  3.07986560e+01, -9.63151577e-01],
       [ 2.76706732e+01,  1.35214035e+01,  1.65795496e-01]])
In [833]:
# Bringing predictions to original scale
for i in range(0, len(X_model.columns)):
    X_pred_varma_model[:,i] = X_model[:cutoff+1].values[-1,i] + X_pred_varma_model[:,i].cumsum()
In [834]:
# Converting array of predictions into a dataframe
features_names = X_model.columns
index_labels = X_diff_model[cutoff:].index
X_pred_varma_model = pd.DataFrame(data=X_pred_varma_model, index=index_labels, columns=features_names)
X_pred_varma_model
Out[834]:
units sp500 walmex
date
2022-01-01 3646.327986 4694.666119 74.433126
2022-04-01 3684.973064 4754.474329 75.673076
2022-07-01 3707.568427 4836.261413 75.897902
2022-10-01 3709.045189 4966.825440 77.777542
2023-01-01 3749.821337 5046.934004 80.605541
2023-04-01 3790.600450 5085.050236 80.725211
2023-07-01 3821.712690 5115.848892 79.762060
2023-10-01 3849.383364 5129.370296 79.927855

Model Assessment¶

The predictions from the VARMA model were plot against the time series to visualize their accuracy.

In [835]:
plot_multivariate_predictions_in_single_chart(train=X_model[:cutoff + 2],
                                            test=X_model[cutoff + 1:],
                                            predictions=X_pred_varma_model,
                                            plot_title='VARMA Model')
2024-11-01T18:02:44.405341 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
In [836]:
plot_multivariate_predictions_in_multiple_charts(train=X_model[:cutoff + 2], 
                                                test=X_model[cutoff + 1:], 
                                                predictions=X_pred_varma_model)
2024-11-01T18:02:46.656340 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T18:02:48.855354 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T18:02:50.906340 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plots above, it can be seen that the VARMA model was not able to yield accurate predictions. The only economic indicator whose predictions were accurately was units. However, it is important to bear in mind that the COVID pandemic disrupted the historical trends not only in Mexico, but in the entire world. So, fairly, it was very dificult to expect that the VARMA model could provide accurate predictions for the years 2021-2023 using the data from 2014-2020 for training. Overall, it seems that the performance of the VARMA model was better than that of the VAR model.

The RMSE, MAE, and $\bf{r^{2}}$ score were calculated for each feature as follows:

In [837]:
varma_rmse, varma_mae, varma_r2 = calculate_scores(predictions=X_pred_varma_model, actuals=X_model[cutoff + 1:])
units
RMSE: 29.895
MAE: 26.242
Coefficient of Determination: 0.871

sp500
RMSE: 806.357
MAE: 763.090
Coefficient of Determination: -11.235

walmex
RMSE: 8.352
MAE: 7.041
Coefficient of Determination: -8.852

In [838]:
# Adding scores to lists
X_rmse_list.append(varma_rmse)
X_mae_list.append(varma_mae)
X_r2_list.append(varma_r2)

6.13 Vector Autoregressive Integrated Moving Average (VARIMA) Model ¶

Modeling Technique ¶

As shown above, the original dataset comprised $40$ historical data points about $7$ features (economic indicators) and $1$ response variable (net sales of WALMEX in millions of MXN). However, based on the results from the Johanssen's cointegration test, only the series units, SP&500, and WALMEX shared enough the same underlying stochastic trend to be used in a multivariate times series model.

In this sense, a third model for multivariate times series prediction was built using a Vector Autoregressive Integrated Moving Average (VARIMA) model for forecasting the $3$ selected features. A VARIMA model was selected as a generalization of the VARMA model to include an integration process (Peixeiro, 2022) for forecasting non-stationary time series, meaning that the time series have a trend, and/or their variance is not constant.

The VARIMA($p$, $d$, $q$) model describes the relationship among two or more non-stationary time series and the impact that their past values and past error terms have on each other (Peixeiro, 2022). Similar to the VARMA model, the VARIMA($p$, $d$, $q$) model is a generalization of the ARMA model for multiple non-stationary time series, where the order $p$ determines how many lagged values impact the present value of the different time series, the order $d$ indicates the number of times the time series have been differenced to become stationary, and the order $q$ determines the number of past error terms that affect their present values.

The function VARIMA from the library Darts in Python was used to build the VARIMA model (Herzen et al., 2022).

As the implementation of the VARIMA model in Darts does not supports the prediction of likelihood parameters, the Akaike Information Criterion (AIC) could not be computed for a different set of $p$ and $q$ values. So, the same orders of $p=1$ and $q=1$ obtained during the optimization of the VARMA model were used for VARIMA($1$,$1$).

Then, the Granger causality test was applied to determine whether past values of a time series are statistically significant in forecasting another time series. Then, a residual analysis was finally carried out to assess whether the residuals were normally and independently distributed (Peixeiro, 2022).

Modeling Assumptions ¶

It is assumed that the features are cointegrated, i.e., the different features share an underlying stochastic trend, and that the linear combination of the different features is stationary (Clower, 2020).

Moreover, it was assumed that the current values are linearly dependent on their past values and past error terms (Peixeiro, 2022), the time series are not a random walk, and seasonality effects were not relevant for modeling.

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [839]:
selected_features = ['units', 'sp500', 'walmex']

X_model = X[selected_features]

cutoff = int(0.8*(len(X_model)))

X_train = X_model[:cutoff]
X_test = X_model[cutoff:]

Model Building¶

Later, the VARIMA model was built using the class VARIMA from the Darts library.

In [840]:
def fit_varima_model(p: int, d: int, q: int, time_series: pd.DataFrame) -> VARIMA:
        """
        Fits a VARIMA model given a set of p, d, and q values.

        Parameters:
        p (int): Order of the autoregressive process in the model.
        d (int): Integration order in the model.
        p (int): Order of the moving average process in the model.
        series (pandas.DataFrame): Data for the time series for fitting the model.

        Returns:
        varima_model (darts.models.forecasting.varima.VARIMA): A VARIMA object fitted according to the combination of p, d and q.


        """
        # Converting pandas.DataFrame to Darts.TimeSeries
        time_series = TimeSeries.from_dataframe(time_series)

        # Fitting VARIMA model
        varima_model = VARIMA(p=p, d=d, q=q, trend="n") # No trend for models with integration
        varima_model.fit(time_series)

        return varima_model
In [841]:
varima_model = fit_varima_model(p=1, d=1, q=1, time_series=X_train)

Granger Causality Test¶

The Granger causality test was applied to determine whether past values of a time series are statistically significant in forecasting another time series (Peixeiro, 2022). This test is unidirectional, so it has to be applied twice for each series.

$$H_{0}: y_{2,t}\text{ does not Granger-cause }y_{1,t}$$$$H_{1}: y_{2,t}\text{ Granger-cause }y_{1,t}$$$$\alpha = 0.95$$
In [842]:
gc_summary = test_granger_causality(df=X_model, lag=1, alpha=0.05)
gc_summary
sp500 --> units

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=2.7736  , p=0.1045  , df_denom=36, df_num=1
ssr based chi2 test:   chi2=3.0048  , p=0.0830  , df=1
likelihood ratio test: chi2=2.8946  , p=0.0889  , df=1
parameter F test:         F=2.7736  , p=0.1045  , df_denom=36, df_num=1

sp500 does not Granger-causes units.


units --> sp500

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=2.2995  , p=0.1382  , df_denom=36, df_num=1
ssr based chi2 test:   chi2=2.4911  , p=0.1145  , df=1
likelihood ratio test: chi2=2.4148  , p=0.1202  , df=1
parameter F test:         F=2.2995  , p=0.1382  , df_denom=36, df_num=1

units does not Granger-causes sp500.


walmex --> units

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=1.6690  , p=0.2046  , df_denom=36, df_num=1
ssr based chi2 test:   chi2=1.8081  , p=0.1787  , df=1
likelihood ratio test: chi2=1.7674  , p=0.1837  , df=1
parameter F test:         F=1.6690  , p=0.2046  , df_denom=36, df_num=1

walmex does not Granger-causes units.


units --> walmex

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=1.6634  , p=0.2054  , df_denom=36, df_num=1
ssr based chi2 test:   chi2=1.8021  , p=0.1795  , df=1
likelihood ratio test: chi2=1.7617  , p=0.1844  , df=1
parameter F test:         F=1.6634  , p=0.2054  , df_denom=36, df_num=1

units does not Granger-causes walmex.


walmex --> sp500

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.8685  , p=0.3576  , df_denom=36, df_num=1
ssr based chi2 test:   chi2=0.9409  , p=0.3320  , df=1
likelihood ratio test: chi2=0.9297  , p=0.3349  , df=1
parameter F test:         F=0.8685  , p=0.3576  , df_denom=36, df_num=1

walmex does not Granger-causes sp500.


sp500 --> walmex

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=10.9697 , p=0.0021  , df_denom=36, df_num=1
ssr based chi2 test:   chi2=11.8839 , p=0.0006  , df=1
likelihood ratio test: chi2=10.3734 , p=0.0013  , df=1
parameter F test:         F=10.9697 , p=0.0021  , df_denom=36, df_num=1

sp500 Granger-causes walmex.


Out[842]:
y2 y1 Granger Causality Test Interpretation SSR F-test p-values SSR Chi2-test p-values LR-test p-values Params F-test p-values
0 sp500 units False sp500 does not Granger-causes units. 0.1045 0.0830 0.0889 0.1045
1 units sp500 False units does not Granger-causes sp500. 0.1382 0.1145 0.1202 0.1382
2 walmex units False walmex does not Granger-causes units. 0.2046 0.1787 0.1837 0.2046
3 units walmex False units does not Granger-causes walmex. 0.2054 0.1795 0.1844 0.2054
4 walmex sp500 False walmex does not Granger-causes sp500. 0.3576 0.3320 0.3349 0.3576
5 sp500 walmex True sp500 Granger-causes walmex. 0.0021 0.0006 0.0013 0.0021

Even though, the results from the Granger-causality tests suggested that it is not possible to reject the null hypothesis that the series does not Granger-causes each other, thus rendering the VARIMA model invalid, it was decided to go further with the model.

Residual Analysis¶

The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).

To do so, the residual analysis is performed in two steps (Peixeiro, 2022):

  • Quantile-quantile plot (Q-Q plot): To qualitatively assess whether the residuals from the model are normally distributed.
  • Ljung-Box test: To test whether the residuals from the model are uncorrelated.
In [843]:
# Getting residuals
residuals = varima_model.residuals(TimeSeries.from_dataframe(X_train), retrain=False, start=0).pd_dataframe()
residuals
`start` value `0` corresponding to timestamp `2014-01-01 00:00:00` is before the first predictable/trainable historical forecasting point for series at index: 0. Ignoring `start` for this series and beginning at first trainable/predictable time: 2021-07-01 00:00:00. To hide these warnings, set `show_warnings=False`.
Out[843]:
component units sp500 walmex
time
2021-07-01 17.695093 42.250966 2.757533
2021-10-01 16.102259 22.015971 -1.820832

However, as the implementation of the VARIMA model in Darts only yielded the residuals for the last two observations, the residual analysis couldn't be performed properly.

Thus, the residuals were only assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:

$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$
In [844]:
# Ljung-Box test of the residuals on 10 lags
cols = X_model.columns.values

for i in range(0, len(cols)):
    print(cols[i])
    print(acorr_ljungbox(residuals.values[:,i], np.arange(1, 3)))
    print("\n")
units
   lb_stat  lb_pvalue
1      2.0   0.157299
2      inf   0.000000


sp500
   lb_stat  lb_pvalue
1      2.0   0.157299
2      inf   0.000000


walmex
   lb_stat  lb_pvalue
1      2.0   0.157299
2      inf   0.000000


For the lag 1, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the VARMA model. Thus, in theory, the residuals are independently distributed and the model could be used for forecasting.

Of course, strictly speaking, as the residual analysis could not be properly performed, it is not possible to know whether the model is valid or not. However, for the purposes of the present study, the model was still used for generating predictions.

Predictions¶

The model was then used to forecast the features in the dataset, using the length of the testing set.

In [845]:
X_pred_varima_model = varima_model.predict(X_test.shape[0])
X_pred_varima_model = X_pred_varima_model.pd_dataframe()
X_pred_varima_model
Out[845]:
component units sp500 walmex
date
2022-01-01 3651.938437 4750.952164 75.167275
2022-04-01 3669.012984 4884.226400 78.487810
2022-07-01 3694.286642 4991.645860 81.403402
2022-10-01 3714.224177 5082.650012 83.158005
2023-01-01 3727.063879 5161.802024 84.974235
2023-04-01 3740.701532 5227.447196 86.644681
2023-07-01 3752.511366 5282.648505 87.837714
2023-10-01 3761.117644 5330.036183 88.904245

Model Assessment¶

The predictions from the VARIMA model were plot against the time series to visualize their accuracy.

In [846]:
plot_multivariate_predictions_in_single_chart(train=X_model[:cutoff+1],
                                            test=X_model[cutoff:],
                                            predictions=X_pred_varima_model,
                                            plot_title='VARIMA Model')
2024-11-01T18:02:59.953497 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
In [847]:
plot_multivariate_predictions_in_multiple_charts(train=X_model[:cutoff + 1], 
                                                test=X_model[cutoff:], 
                                                predictions=X_pred_varima_model)
2024-11-01T18:03:02.154255 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T18:03:04.304255 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
2024-11-01T18:03:06.657254 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

In view of the plots above, it can be seen that the VARIMA model was not able to yield accurate predictions. The only economic indicator whose predictions were more less accurate was units. However, it is important to bear in mind that the COVID pandemic disrupted the historical trends not only in Mexico, but in the entire world. So, fairly, it was very dificult to expect that the VARMA model could provide accurate predictions for the years 2021-2023 using the data from 2014-2020 for training. Notwithstanding with the above, it seems that the performance of the VARIMA model was slightly worse than that of the VARMA model.

The RMSE, MAE, and $\bf{r^{2}}$ score were calculated for each feature as follows:

In [848]:
varima_rmse, varima_mae, varima_r2 = calculate_scores(predictions=X_pred_varima_model, actuals=X_model[cutoff:])
units
RMSE: 57.604
MAE: 42.577
Coefficient of Determination: 0.523

sp500
RMSE: 939.278
MAE: 898.338
Coefficient of Determination: -15.601

walmex
RMSE: 14.040
MAE: 12.263
Coefficient of Determination: -26.838

In [849]:
# Adding scores to lists
X_rmse_list.append(varima_rmse)
X_mae_list.append(varima_mae)
X_r2_list.append(varima_r2)

6.14 Random Forest (RF) Regression Model ¶

Modeling Technique ¶

As it was desired to predict a target numeric value, a regression model was built for predicting the net sales of WALMEX based on the forecast for the selected $3$ features from the multivariate time series models: units, S&P500, and WALMEX stock value.

It was decided to use a Random Forest (RF) approach, as this algorithm provides better models than decision trees and allows to easily identify the most important features in a model (Géron, 2019). Moreover, it is not necessary to neither comply with the assumptions of the linear regression nor to perform feature scaling.

Modeling Assumptions ¶

The main assumption is that the net sales of WALMEX can be predicted as a function of the economic indicators of Mexico and USA.

Furthermore, it was assumed that the problem was not linearly separable, thus, a linear regression approach was discarded.

Test Design ¶

The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [850]:
selected_features = ['units', 'sp500', 'walmex']

X_model = X[selected_features]

X_train, X_test, y_train, y_test = train_test_split(X_model, y, test_size = 0.2, random_state = 0)

Model Building¶

The random forest model was built using the RandomForestRegressor class from the scikit-learn library.

In [851]:
rf_model = RandomForestRegressor(n_estimators = 500, random_state = 0)
rf_model.fit(X_train, y_train)
Out[851]:
RandomForestRegressor(n_estimators=500, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(n_estimators=500, random_state=0)

Predictions¶

The model was then used to predict the response variable using the features from the testing set.

In [852]:
y_pred_rf_model = rf_model.predict(X_test)

# Transforming predictions into a Pandas series
y_pred_rf_model = pd.DataFrame(y_pred_rf_model, index=X_test.index, columns=['net_sales'])
y_pred_rf_model
Out[852]:
net_sales
date
2019-07-01 167946.916
2019-01-01 164339.552
2020-04-01 167019.450
2015-01-01 119220.126
2016-07-01 126007.214
2017-10-01 144273.058
2021-01-01 186783.454
2016-10-01 124694.170

Feature Importances¶

Then, the importance of each feature was retrieved using the Mean Decrease in Impurity criterion:

In [853]:
importances = rf_model.feature_importances_
importances
Out[853]:
array([0.67663784, 0.14493053, 0.17843163])
In [854]:
features_names = X_model.columns
importances_series = pd.Series(importances, index=features_names)
chart_title = "Feature Importances in Random Forest Regression Model"

fig, ax = plt.subplots(figsize=(8,5))
importances_series.sort_values(ascending=False).plot.bar(ax=ax)
plt.title(chart_title)
plt.xlabel("Features")
plt.ylabel("Mean Decrease in Impurity")
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
plt.show()
2024-11-01T18:03:09.466252 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Thus, in view of the plot above, the feature Units was the most important in the model.

Model Assessment¶

The predictions were plot against the historical net sales data to visually assess the performance of the regression model.

In [855]:
def plot_predictions_from_regression_model(predictions: pd.DataFrame, time_series: pd.DataFrame = y, chart_title: str = 'Predictions from Regression Model') -> None:
        """
        Plots the predictions from a regression model compared to a historical time series.

        Parameters:
        predictions (pandas.DataFrame): Predictions from the regression model with a DateTime index.
        time_series (pandas.DataFrame): Predictions from the regression model with a DateTime index.
        chart_title (str): Title for the chart.

        Returns:
        None
        """

        # Colors
        time_series_color = sns.color_palette('Blues_r')[0]
        pred_color = '#C70039'

        # If predictions and time_series are DataFrames, select the first column
        if isinstance(predictions, pd.DataFrame):
                predictions = predictions.iloc[:, 0]
        if isinstance(time_series, pd.DataFrame):
                time_series = time_series.iloc[:, 0]

        # Plots
        fig, ax = plt.subplots(figsize=(8,5))
        sns.lineplot(x=time_series.index, y=time_series.values, ax=ax, color=time_series_color, zorder=1)
        sns.scatterplot(x=predictions.index, y=predictions.values, ax=ax, color = pred_color, zorder=2)

        # Labels
        plt.title(chart_title)
        plt.xlabel("Date")
        plt.ylabel("Net Sales (mdp)")

        # Adding legend to plot
        legend_lines = [Line2D([0], [0], color=time_series_color, lw=2),
                        Line2D([0], [0], color=pred_color, lw=2, linestyle='None', marker="o")]
        plt.legend(legend_lines, ['Time Series', 'Predictions'], loc='upper left', facecolor='white', frameon = True)

        # Adjusting Y ticks to Currency format
        ticks = ax.get_yticks()
        new_labels = [f'${int(i):,.0f}' for i in ticks]
        ax.set_yticklabels(new_labels)

        plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
        plt.show()
In [856]:
plot_predictions_from_regression_model(predictions=y_pred_rf_model,
                                        chart_title="Predictions from RF Model VS. WALMEX Historical Net Sales")
2024-11-01T18:03:12.350255 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the plot above, it can be seen that the predictions are very close of the historical time series. Thus, the Random Forest regression model was able to capture the most important features for predicting the net sales of WALMEX.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated for each feature as follows:

In [857]:
rf_rmse, rf_mae, rf_r2 = calculate_scores(pd.DataFrame(y_test), y_pred_rf_model)
net_sales
RMSE: 16256.433
MAE: 12380.448
Coefficient of Determination: 0.516

In [858]:
# Appending values to scores lists
reg_rmse_list.append(rf_rmse[0])
reg_mae_list.append(rf_mae[0])
reg_r2_list.append(rf_r2[0])

Several number of trees were tested, and the following results were obtained:

Trees RMSE MAE $r^{2}$
50 16465.794 13240.795 0.504
100 16464.343 12533.214 0.513
500 16256.433 12380.448 0.516

Thus, based on their performance, the model selected was the RF model with 500 trees.

6.15 Support Vector Regression (SVR) Model ¶

Modeling Technique ¶

Likewise, another regression model was built for predicting the net sales of WALMEX based on the forecast for the selected $3$ features from the multivariate time series models: units, S&P500, and WALMEX stock value.

A Support Vector Regression (SVR) approach was selected, as this algorithm is less restrictive in their assumptions in comparison to a linear regression, it's more robust to outliers, and does not strictly assume a linear relationship between the independent and dependent variables even if using a linear kernel (Géron, 2019).

Modeling Assumptions ¶

The main assumption is that the net sales of WALMEX can be predicted as a function of the economic indicators of Mexico and USA.

Also, a strict linear relationship between the variables was not assummed, neither the requirements of normally distribution of the residuals, no multicollinearity, and homoscedasticity.

Test Design ¶

Firstly, according to the cointegration test, the data was sliced to select only the features Units, SP&500, and WALMEX stock value.

In [859]:
selected_features = ['units', 'sp500', 'walmex']

X_model = X[selected_features]

Then, the dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.

In [860]:
X_train, X_test, y_train, y_test = train_test_split(X_model, y, test_size = 0.2, random_state = 0)

Model Building¶

Firstly, the data was scaled using the function StandardScaler.

In [861]:
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

Then, the SVR model was built using the SVR class from the scikit-learn library.

In [862]:
svr_model = SVR(kernel='linear', C=1000.0, epsilon=0.1).fit(X_train_sc, y_train)

Predictions¶

The model was then used to predict the response variable using the features from the testing set.

In [863]:
y_pred_svr_model = svr_model.predict(X_test_sc)

# Transforming predictions into a Pandas series
y_pred_svr_model = pd.DataFrame(y_pred_svr_model, index=X_test.index, columns=['net_sales'])
y_pred_svr_model
Out[863]:
net_sales
date
2019-07-01 164807.934413
2019-01-01 151221.875855
2020-04-01 163867.176807
2015-01-01 123791.350612
2016-07-01 130167.111614
2017-10-01 141331.989989
2021-01-01 179614.339687
2016-10-01 129726.974562

Model Assessment¶

The predictions were plot against the historical net sales data to visually assess the performance of the regression model.

In [864]:
plot_predictions_from_regression_model(predictions=y_pred_svr_model,
                                    chart_title="Predictions from SVR Model VS. WALMEX Historical Net Sales")
2024-11-01T18:03:15.624252 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

From the plot above, it can be seen that the predictions are close of the historical time series. Thus, the SVR model was able to capture the most important features for predicting the net sales of WALMEX.

Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated for each feature as follows:

In [865]:
svr_rmse, svr_mae, svr_r2 = calculate_scores(pd.DataFrame(y_test), y_pred_svr_model)
net_sales
RMSE: 14535.582
MAE: 10676.184
Coefficient of Determination: 0.402

In [866]:
# Appending values to scores lists
reg_rmse_list.append(svr_rmse[0])
reg_mae_list.append(svr_mae[0])
reg_r2_list.append(svr_r2[0])

Several values for the parameters C were tested, and the following results were obtained:

C Epsilon RMSE MAE $r^{2}$
1.0 0.1 21540.030 15545.984 -176851.575
100.0 0.1 17257.012 13353.768 -10.351
1000.0 0.1 14535.582 10676.184 0.402
5000.0 0.1 16414.855 11640.130 0.470
10000.0 0.1 16223.993 11624.612 0.475
100000.0 0.1 18720.282 14280.358 0.465

Thus, based on their performance, the model selected was the SVR model with C=1000.0.


7. Evaluation¶


In this section, the research question was answered: Will Walmart of Mexico be able to double its sales within the next ten years? If so, when?.

Firstly, the models were ranked according to their performance, and the best univariate, multivariate and regression models were selected and retrained using all the historic data.

Then, the WALMEX net sales forecasts were estimated using the two approaches defined in this study:

  1. Best univariate time series model.
  2. Best multivariate time series model coupled with the best regression model.

In both cases, the net sales were forecasted over a period of ten years (2024Q1-2033Q4).

7.1 Models Ranking ¶

According to the model assessment performed to each model, the models were ranked in terms of the scores RMSE, MAE and $r^{2}$.

In [867]:
# Creating Pandas Dataframe from Score Lists of Univariate Time Series Models
univariate_scores_dict = {'Model': univariate_models_list,'RMSE': rmse_list, 'MAE': mae_list, 'Coeff. of Determination': r2_list} 
univariate_scores_df = pd.DataFrame(univariate_scores_dict)
univariate_scores_df.sort_values(by='RMSE', ascending=True)
Out[867]:
Model RMSE MAE Coeff. of Determination
5 SARIMA 2675.576039 2372.531418 0.983446
3 ARMA 5006.672720 4190.297162 0.942035
1 AR 7687.806004 6558.916411 0.863330
8 HW 9508.312432 7645.737108 0.790938
2 AR w/Additive Decomp. 11272.203504 8970.403291 0.706176
9 Prophet 12323.068089 10710.708420 0.648839
0 MA 15216.900678 9651.900000 0.464547
6 SARIMAX 19736.315360 18619.001507 0.099256
4 ARIMA 27274.027756 24120.852698 -0.720156
7 SES 180107.784220 166655.345652 -74.012589
In [868]:
# Best performing univariate time series model in terms of RMSE
univariate_scores_df.loc[univariate_scores_df['RMSE'] == univariate_scores_df['RMSE'].min(), :]
Out[868]:
Model RMSE MAE Coeff. of Determination
5 SARIMA 2675.576039 2372.531418 0.983446
In [869]:
# Best performing univariate time series model in terms of MAE
univariate_scores_df.loc[univariate_scores_df['MAE'] == univariate_scores_df['MAE'].min(), :]
Out[869]:
Model RMSE MAE Coeff. of Determination
5 SARIMA 2675.576039 2372.531418 0.983446

Thus, in view of the results above, the $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ model was the univariate time series model with the best performance in terms of the scores RMSE, MAR and $r^{2}$.

In the case of the multivariate models, the scores are as follows:

In [870]:
# Creating Pandas Dataframe from Score Lists of Multivariate Time Series Models
multivariate_scores_dict = {'Model': multivariate_models_list,'RMSE': X_rmse_list, 'MAE': X_mae_list, 'Coeff. of Determination': X_r2_list} 
multivariate_scores_df = pd.DataFrame(multivariate_scores_dict)
multivariate_scores_df.sort_values(by='RMSE', ascending=True)
Out[870]:
Model RMSE MAE Coeff. of Determination
1 VARMA [29.895275196270124, 806.357115051538, 8.35249... [26.241590913676305, 763.0903497943536, 7.0409... [0.8714382007640038, -11.235244127649448, -8.8...
0 VAR [41.68363399538274, 741.6975334422666, 7.84074... [30.27016894936554, 699.7044134829706, 6.13388... [0.7500587264511199, -9.351694912545407, -7.68...
2 VARIMA [57.60360726976815, 939.2783902259093, 14.0402... [42.57743313428142, 898.3375517962176, 12.2627... [0.522683780550852, -15.601466594952402, -26.8...

Thus, in view of the results above, the $\text{VARMA}(1, 1)$ model was the multivariate time series model with the best performance in terms of the scores RMSE, MAR and $r^{2}$.

Finally, the scores from the regression models is shown below:

In [871]:
# Creating Pandas Dataframe from Score Lists of Regression Models
regression_scores_dict = {'Model': regression_models_list,
                          'RMSE': reg_rmse_list, 
                          'MAE': reg_mae_list, 
                          'Coeff. of Determination': reg_r2_list} 
regression_scores_df = pd.DataFrame(regression_scores_dict)
regression_scores_df.sort_values(by='RMSE', ascending=True)
Out[871]:
Model RMSE MAE Coeff. of Determination
1 SVR 14535.581695 10676.183853 0.401998
0 RF 16256.433470 12380.448000 0.515834

Overall, both regression models have a very similar performance. However, as the RF model may be forced to predict net sales values far outside the training range, the SVR was selected as the best performing model.

7.2 Best Univariate Time Series Model ¶

The $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ model, which was the best performing univariate time series model, was retrained with all the available data to, then, generate the forecasts of the WALMEX net sales over the next ten years.

In [872]:
univariate_model = optimize_sarima_model(p=[1], 
                                        d=1, 
                                        q=[1], 
                                        P=[1], 
                                        D=1, 
                                        Q=[1], 
                                        m=4, 
                                        time_series=y)
The best model is (p = 1, d = 1, q = 1)(P = 1, D = 1, Q = 1)(m = 4), with an AIC of 709.74.

         (p,d,q)(P,D,Q)m         AIC
0  (1, 1, 1, 1, 1, 1, 4)  709.737995

The predictions were estimated as follows:

In [873]:
forecasted_years = 10
y_pred_univariate_model = univariate_model.get_prediction(start=len(y), end=(len(y)+(forecasted_years*4))).predicted_mean
y_pred_univariate_model = pd.DataFrame(y_pred_univariate_model).rename(columns={'predicted_mean':'net_sales'})
y_pred_univariate_model.index.name = 'date'
y_pred_univariate_model.tail()
Out[873]:
net_sales
date
2033-01-01 375846.239256
2033-04-01 383747.990764
2033-07-01 383201.395711
2033-10-01 424049.659129
2034-01-01 393491.468211

7.3 Best Multivariate Time Series Model ¶

Then, the $\text{VARMA}(1,1)$ model, which was the best performing multivariate time series model, was retrained with all the available data to generate the forecasts of the predictors units, S&P500, and WALMEX stock value, over the next ten years.

In turn, the above forecasts were used to predict the WALMEX net sales with the SVR model.

In [874]:
X_model = X[['units', 'sp500', 'walmex']]
X_model_diff = X_diff[['units', 'sp500', 'walmex']]

multivariate_model = optimize_varma_model(p=[1], q=[1], time_series=X_model_diff)
The best model is (p = 1, q = 1), with an AIC of 1130.01.

    (p,q)          AIC
0  (1, 1)  1130.007552

Then, the predictions from the VARMA model were obtained.

In [875]:
forecasted_years = 10
X_pred_multivariate_model = multivariate_model.get_prediction(start=0, end=(forecasted_years*4)).predicted_mean.values
In [876]:
# Bringing predictions to original scale
for i in range(0, len(X_model.columns)):
    X_pred_multivariate_model[:,i] = X_model.values[-1,i] + X_pred_multivariate_model[:,i].cumsum()
In [877]:
# Converting array of predictions into a dataframe
features_names = X_model.columns
index_labels = y_pred_univariate_model.index
X_pred_multivariate_model = pd.DataFrame(data=X_pred_multivariate_model, index=index_labels, columns=features_names)
X_pred_multivariate_model.tail()
Out[877]:
units sp500 walmex
date
2033-01-01 5013.311546 6873.751990 98.957488
2033-04-01 5042.021394 7040.376839 100.593166
2033-07-01 5069.858295 7243.914734 102.655605
2033-10-01 5082.205722 7364.412426 106.121013
2034-01-01 5115.372305 7429.510337 108.052018
In [878]:
plot_multivariate_predictions_in_single_chart(train=X_model, 
                                            test=X_model, 
                                            predictions=X_pred_multivariate_model, 
                                            plot_title='Predictors Forecasts From VARMA Model')
2024-11-01T18:03:28.801252 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Later, the SVR model was also retrained with all the historical data.

In [879]:
X_sc = sc.transform(X[['units', 'sp500', 'walmex']])
In [880]:
regressor = SVR(kernel='linear', C=1000, epsilon=0.1)
regressor.fit(X_sc, y)
Out[880]:
SVR(C=1000, kernel='linear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVR(C=1000, kernel='linear')

Finally, the predictions for the WALMEX net sales were obtained from the regression model when using the forecasts for the predictors from the VARMA model.

In [881]:
y_pred_multivariate_model = regressor.predict(sc.transform(X_pred_multivariate_model))
y_pred_multivariate_model
Out[881]:
array([206251.94427811, 208882.66509991, 210690.84116724, 212791.10223392,
       216495.10085519, 218276.31331256, 219481.22773762, 221351.32599743,
       221858.30564074, 224353.24440028, 227774.69287867, 229968.45027504,
       234232.26088527, 237652.34315964, 239058.23108442, 242502.62858522,
       245262.6805464 , 248014.97532063, 249600.58095426, 250853.39336929,
       252746.32868685, 256356.59550893, 258061.41293436, 260588.72123067,
       263182.86719354, 264808.35944326, 267801.66801255, 273429.23789836,
       278356.25264569, 281269.45989441, 286172.83336491, 289033.89837456,
       290625.45742746, 290189.64514756, 290438.25998121, 291808.95300327,
       293072.05409541, 296861.51496584, 301230.56186466, 305038.88432865,
       308231.33496768])
In [882]:
# Converting predictions into a dataframe
y_pred_multivariate_model = pd.DataFrame(y_pred_multivariate_model, columns=['net_sales']).set_index(y_pred_univariate_model.index)
y_pred_multivariate_model.head()
Out[882]:
net_sales
date
2024-01-01 206251.944278
2024-04-01 208882.665100
2024-07-01 210690.841167
2024-10-01 212791.102234
2025-01-01 216495.100855

7.4 Research Question's Solution ¶

Now, after all the modeling and evaluation, the predictions from the univariate and multivariate/regression models were assessed visually:

In [883]:
def plot_sales_forecast(time_series: pd.DataFrame, predictions: pd.DataFrame, chart_title: str) -> None:
        """
        Plots a time series and its forecast.

        Parameters:
        time_series (pandas.DataFrame): Historical time series data.
        predictions (pandas.DataFrame): Forecasts for the time series.
        chart_title (str): Title to be displayed on the chart.

        Returns:
        None
        """

         # Colors
        time_series_color = sns.color_palette('Blues_r')[0]
        pred_color = '#C70039'

        # If predictions and time_series are DataFrames, select the first column
        if isinstance(predictions, pd.DataFrame):
                predictions = predictions.iloc[:, 0]
        if isinstance(time_series, pd.DataFrame):
                time_series = time_series.iloc[:, 0]
        
        # Adjusting plots continuity
        last_value = time_series.iloc[-1]
        last_index = time_series.index[-1]
        last_observation = pd.Series(last_value, index=[last_index])
        predictions = pd.concat([predictions,last_observation]).sort_index()

        # Plots
        fig, ax = plt.subplots(figsize=(8,5))
        sns.lineplot(x=time_series.index, y=time_series.values, ax=ax, color=time_series_color, zorder=1)
        sns.lineplot(x=predictions.index, y=predictions.values, ax=ax, color = pred_color, zorder=2)

        # Adding shadow to predictions
        first_pred_x = predictions.index[0]
        last_pred_x = predictions.index[-1]
        ax.axvspan(first_pred_x, last_pred_x, color='#808080', alpha=0.2)

        # Labels
        plt.title(chart_title)
        plt.xlabel("Date")
        plt.ylabel("Net Sales (mdp)")

        # Adding legend to plot
        legend_lines = [Line2D([0], [0], color=time_series_color, lw=2),
                        Line2D([0], [0], color=pred_color, lw=2)]
        plt.legend(legend_lines, ['Historical', 'Forecast'], loc='upper left', facecolor='white', frameon=True)

        # Adjusting Y ticks to Currency format
        ticks = ax.get_yticks()
        new_labels = [f'${int(i):,.0f}' for i in ticks]
        ax.set_yticklabels(new_labels)

        plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png',  bbox_inches = 'tight')
        plt.show()
In [884]:
# Forecast from the Univariate Model 
plot_sales_forecast(time_series=y, 
                    predictions=y_pred_univariate_model, 
                    chart_title='Net Sales Forecast from Univariate Model')
2024-11-01T18:03:31.392255 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/
In [885]:
# Forecast from the Multivariate/Regression Model 
plot_sales_forecast(time_series=y, 
                    predictions=y_pred_multivariate_model, 
                    chart_title='Net Sales Forecast from Multivariate-Regression Models')
2024-11-01T18:03:33.443254 image/svg+xml Matplotlib v3.7.5, https://matplotlib.org/

Finally, the sales on 2023Q3 were compared with the forecasts to determine when WALMEX's goal could be achieved (if possible).

In [886]:
def get_conclusion(predictions: pd.DataFrame, time_series: pd.DataFrame = y, last_date: str ='2023-07-01'):
        """
        Looks for the prediction in which the sales are doubled according with the input last date and prints the result.

        Parameters:
        predictions (pandas.DataFrame): Net sales predictions data.
        time_series (pandas.DataFrame): Historical net sales data.
        last_date (str): Date in which the sales are going to be compared to.

        Returns:
        None
        """

        last_sales = time_series.loc[last_date]
        sales_doubled = predictions.loc[predictions.iloc[:,0] >= 2*last_sales,:]
        
        print(f'WALMEX last figure of net sales was ${float(last_sales):,.2f} (mdp) on 2023Q3.')

        try:
            # Converting to datetime
            date = pd.to_datetime(str(sales_doubled.index[0]))

            # Converting to year-quarter format
            year_quarter = date.to_period('Q')

            # Printing result
            print(f'WALMEX will double its net sales in: {str(year_quarter)}, with a forecasted figure of about: ${float(sales_doubled.values[0]):,.2f} (mdp).')
        
        except:
            print('WALMEX will not be able to double its net sales within the next ten years.')
In [887]:
get_conclusion(y_pred_univariate_model)
WALMEX last figure of net sales was $211,436.00 (mdp) on 2023Q3.
WALMEX will double its net sales in: 2033Q4, with a forecasted figure of about: $424,049.66 (mdp).
In [888]:
get_conclusion(y_pred_multivariate_model)
WALMEX last figure of net sales was $211,436.00 (mdp) on 2023Q3.
WALMEX will not be able to double its net sales within the next ten years.

Thus, in view of the results from this study, according to the $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ model, WALMEX will double its net sales in 2033Q3, with a forecasted figure of about $424,049.66 (mdp).

On the other hand, according to the $\text{VARMA}(1,1)$ and $SVR$ models, WALMEX will not be able to double its net sales within the next ten years. Even if the trends in their number of units, SP&500 and WALMEX stock value remains constant.

However, the combination of the $\text{VARMA}(1,1)$ and SVR models was not capable of capturing the seasonality of the historical net sales. Thus, it is considered that the predictions from the $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ model were more reliable.


8. Deployment¶


According to the above results, the best performing model was $\text{SARIMA}(1,1,1)(1,1,1)_{4}$, which was a univariate time series model fitted from the historical net sales data of WALMEX.

For the deployment, it was proposed to put said model into production using the Streamlit cloud services. Please click here to play with the interactive app.


9. Conclusions¶


According to the results from the present study, Walmart of Mexico (WALMEX) will meet its goal of doubling its sales from 211,436 mdp to 424,050 mdp in the third quarter of 2033, in about nine years, according to the fitted $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ model.

Moreover, the combination of a VARMA model for forecasting economic variables and the SVR model for predicting the net sales based on the forecasts of the first proved to be unsucessful. Whereas a more traditional approach of a SARIMA model was more able to capture the trend and seasonality of the WALMEX net sales time series.

In this context, as future perspectives, it is suggested to further explore how to automate the collection of WALMEX net sales data for building an model that can be retrained and yield predictions each time a new data point is added to the time series.


10. References¶


  • Atwan, T. A. (2022). Time Series Analysis with Python Cookbook. Packt Publishing Ltd.
  • Brownlee, J. (2020). How to Decompose Time Series Data into Trend and Seasonality. https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/
  • Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). CRISP-DM 1.0: Step-by-step data mining guide. CRISP-DM consortium. https://api.semanticscholar.org/CorpusID:59777418
  • Clower, E. (2020). A Guide to Conducting Cointegration Tests. https://www.aptech.com/blog/a-guide-to-conducting-cointegration-tests/
  • Géron, A. (2019). Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow (2nd ed.). O’Reilly Media, Inc.
  • Herzen, J., Lassig, F., Giuliano-Piazzetta, S., Neuer, T., Tafti, L., Raille, G., Van Pottelbergh, T., Pasieka, M., Skrodzki, A., Huguenin, N., Dumonal, M., Koscisz, J., Bader, D., Gusset, F., Benheddi, M., Williamson, C., Kosinski, M., Petrik, M. & Grosch, G. (2022). Darts: User-Friendly Modern Machine Learning for Time Series. Journal of Machine Learning Research. 23(124):1-6. http://jmlr.org/papers/v23/21-1177.html
  • Kotzé, K. (2024). Cointegration. https://github.com/EconMaett/TSA/blob/main/T13-multivariate-cointegration/14_cointegration_slides.pdf
  • Kulkarni, A. R., Shivananda, A., Kulkarni, A., & Krishnan, V. A. (2023). Time Series Algorithms Recipes: Implement Machine Learning and Deep Learning Techniques with Python. Apress Media, LLC. https://doi.org/10.1007/978-1-4842-8978-5
  • Jamieson, S. (2019). Time Series Decomposition & Prediction in Python. https://www.linkedin.com/pulse/time-series-decomposition-prediction-python-stuart-jamieson-cfa-frm
  • NIST/SEMATECH (2012). e-Handbook of Statistical Methods. https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
  • Peixeiro, M. (2022). Time Series Forecasting in Python. Manning Publications Co.
  • Rollins, J. B. (2015). Metodología Fundamental para la Ciencia de Datos. Somers: IBM Corporation. https://www.ibm.com/downloads/cas/WKK9DX51
  • Wal-Mart de México S.A.B. de C.V. (2024). Información Financiera Trimestral 4T. https://files.walmex.mx/upload/files/2023/ES/Trimestral/4T23/WALMEX_4T23_BMV.pdf
  • Wang, L. (2018). Cointegration and Pairs Trading. https://letianzj.github.io/cointegration-pairs-trading.html
In [889]:
%pip freeze > requirements.txt
Note: you may need to restart the kernel to use updated packages.